Bottom Line: This paper makes a timely, novel, and important contribution by revealing systematic vulnerabilities in LLM-based peer review systems. The finding that fabricated papers achieve 82% acceptance rates—and the discovery of "concern-acceptance conflict" where reviewers flag integrity issues yet recommend acceptance—represents a significant warning for scientific publishing. However, critical methodological gaps, particularly the absence of human reviewer baselines and insufficient statistical reporting, substantially limit the strength of conclusions. With substantial revisions addressing these issues, this work has the potential for high impact at a top-tier venue.
The paper investigates a critical question at the intersection of AI safety and scientific integrity: Can AI research agents generate convincing but unsound papers that deceive LLM-based review systems? This addresses an urgent concern as the scientific community increasingly adopts LLM-powered research assistants for paper generation and LLM-based systems for peer review, creating potential for fully automated "AI-only publication loops" without human oversight.
Primary Contributions:
The BadScientist framework consists of two adversarial components: a Paper Agent that generates fabricated papers using five presentation-manipulation strategies (TooGoodGains, BaselineSelect, StatTheater, CoherencePolish, ProofGap) requiring no real experiments, and a Review Agent that evaluates papers using multiple LLM models (o3, o4-mini, GPT-4.1) calibrated against ICLR 2025 data.
Key empirical findings include fabricated papers achieving acceptance rates up to 82.0%, with the discovery of a novel "concern-acceptance conflict" phenomenon where reviewers frequently flag integrity issues yet assign acceptance-level scores. Tested mitigation strategies (Review-with-Detection and Detection-Only) show limited effectiveness, with detection accuracy barely exceeding random chance (57-67% vs. 50% baseline).
Theoretical contributions include formal error guarantees through concentration bounds (Theorem 1: Bernstein-McDiarmid bounds for ensemble scoring) and calibration analysis (Propositions 1-2 on threshold selection), with empirical validation demonstrating these bounds hold in practice.
The BadScientist framework is well-conceived conceptually. The bi-directional adversarial setup (generator vs. reviewer) appropriately mirrors real attack scenarios, and the multi-model review approach (o3, o4-mini, GPT-4.1) reduces single-model biases. The use of formal error guarantees through concentration bounds adds theoretical rigor often absent in adversarial ML research.
However, critical experimental design flaws severely limit interpretability:
Missing Controls: The paper lacks essential baseline comparisons. Most critically, there is no evaluation of how human reviewers perform on the same fabricated papers, making it impossible to determine whether the 82% acceptance rate reflects AI-specific vulnerabilities or general review system weaknesses. Additionally, no legitimate papers are evaluated through the same review system to establish baseline acceptance rates. This fundamental omission means we cannot determine if 82% acceptance is anomalous.
Ecological Validity Concerns: The laboratory setting with seed topics and single-shot reviews doesn't reflect real peer review, which includes author rebuttals, reviewer discussions, and meta-reviews. The paper doesn't test whether adversaries could adapt strategies based on rejections. Moreover, results may be highly sensitive to exact prompts used (not disclosed in detail), raising questions about reproducibility and generalizability.
Single-Domain Testing: Calibration exclusively on ICLR 2025 (AI/ML conference) raises serious generalization concerns. Different scientific fields have vastly different review cultures, standards, and susceptibility to manipulation. Medical journals requiring data availability, for instance, might show dramatically different vulnerability patterns.
The five atomic strategies show mixed realism:
TooGoodGains and BaselineSelect are well-documented phenomena in actual scientific misconduct, reflecting real pressure to publish extraordinary results and known cherry-picking behaviors.
StatTheater is partially problematic: while polished presentation with "forthcoming" repository links reflects real manipulation, fabricating precise p-values without actual experiments risks detection through statistical impossibility. Real statistical tests on fabricated data may produce internally inconsistent results that careful reviewers should catch.
CoherencePolish (professional formatting) is concerning as a manipulation strategy. That this improves acceptance suggests the review system is evaluating surface-level presentation rather than content validity—a finding about reviewer quality, not paper convincingness.
ProofGap is critically underspecified. The paper doesn't detail how proof gaps are concealed, yet this is arguably the most intellectually challenging manipulation and needs substantial elaboration.
Critical Missing Elements: No strategy addresses fundamental issues that fabricated papers should exhibit: lack of reproducible methods, inconsistent internal logic across sections, or absence of domain-specific technical depth that experts would recognize. The strategies also don't include adversarial evasion techniques (e.g., prompt injection defenses against detection systems).
Critical concern with model naming: The paper lists models as "o3, o4-mini, GPT-4.1, GPT-5" but the arXiv timestamp (2510.xxxx, suggesting October 2024 submission) creates temporal inconsistencies. While o3, o4-mini, and GPT-4.1 are confirmed 2025 models, GPT-5's status and availability for research use in October 2024 is unclear and requires clarification.
Selection justification absent: No rationale explains why these specific models were chosen. Critically, the paper omits major LLM families including Claude (Anthropic), Gemini (Google), and Llama (Meta), severely limiting conclusions about LLM reviewers generally. The paper doesn't specify exact model versions with timestamps, making replication difficult. Additionally, why is GPT-5 (ostensibly the most capable model) relegated to integrity checking rather than primary review?
Strengths: Using ICLR 2025 data (11,565 submissions, 32.08% acceptance rate) provides substantial calibration data from a prestigious venue with rigorous standards. The formal approach using concentration bounds is technically sound.
Critical concerns:
Temporal validity issue: The arXiv number suggests October 2024 submission, but ICLR 2025 decisions weren't released until January 2025. This timeline inconsistency requires clarification—how was calibration performed?
Distribution mismatch: ICLR is domain-specific (ML/AI). Calibrating on ICLR may not generalize to biology, medicine, physics, or social sciences where review cultures differ fundamentally.
Potential contamination: ICLR 2025 reportedly introduced an LLM review feedback agent (documented in related research). If calibration data includes LLM-influenced reviews, this creates circular contamination that undermines the baseline.
Methodological opacity: The paper never specifies HOW calibration was performed—which review aspects were calibrated (scores, decision boundaries, concern thresholds)? How many ICLR papers were used? Were fabricated papers matched to ICLR topic distribution? This lack of transparency is a major reproducibility issue.
Concentration bounds (Hoeffding, McDiarmid) are appropriate for providing finite-sample guarantees, and the theoretical analysis (Appendix A.2) is exemplary with rigorous proofs and empirical validation (Figure 3). However, standard concentration bounds assume i.i.d. samples, but review outcomes may be correlated (same LLM, similar prompts), and papers from the same seed topics may share features. The paper doesn't address this dependence structure.
Calibration analysis is partially appropriate but incomplete. While calibration is crucial for LLM evaluation, the paper doesn't distinguish between model calibration (confidence matches accuracy), cross-calibration between LLMs, and calibration to human review distributions. Standard calibration metrics (Expected Calibration Error, Maximum Calibration Error, reliability diagrams) are not reported.
Critical missing methods:
ACPT (Acceptance Rate) is a direct measure of the research question but has significant limitations. It's a binary metric that doesn't capture reviewer confidence degrees. Most critically, without acceptance rates for legitimate papers in the same system, we cannot interpret whether 82% is anomalous. The paper also doesn't report score distributions—all acceptances aren't equal (score of 6.5 vs. 8.0 matters).
ICR (Integrity Concern Rate) definition is unclear. Based on context, it measures when reviewers flag concerns yet accept papers, but the operational definition is never formally stated. What counts as "flagging" a concern? Are all concerns weighted equally? This non-standard metric needs precise mathematical definition.
Critical missing metrics:
The paper tests only two mitigation approaches with insufficient detail. Based on naming, ReD appears to be "Review-with-Detection" (bundling conventional review with mandatory AI-content detection) and DetOnly is "Detection-Only" (pure detection without review), but exact implementations are not fully specified.
Critical limitations:
Missing essential baselines: Should compare against statistical forensics (detecting impossible p-value distributions), plagiarism detection (template reuse), adversarial training (training reviewers on fabricated examples), multi-stage review with human verification, and ensemble detection methods.
No adversarial robustness: Are mitigations robust to adaptive attackers? If generators know about ReD/DetOnly, can they evade them? No adversarial evasion testing is reported.
Missing from security literature: The paper doesn't incorporate detection methods from adversarial ML research (input sanitization, uncertainty quantification, anomaly detection) or multi-agent debate systems shown to improve LLM accuracy.
The paper's most severe weakness is the disconnect between rigorous theoretical foundations and insufficient empirical reporting. While Appendix A.2 provides exceptional theoretical analysis with formal proofs and guarantees, the main experimental results lack fundamental statistical measures.
Critical omissions in all results tables:
According to standard statistical reporting guidelines (APA, CONSORT), all effect estimates must include point estimate, 95% confidence interval, sample size, and effect size measure. This paper provides only point estimates.
Example of the problem: The abstract reports "acceptance rates up to 82.0%" but we don't know if this represents 82/100 papers (95% CI: 73-89%) or 41/50 papers (95% CI: 68-92%)—a critical difference for interpretation.
The paper reports no statistical significance tests comparing:
Without significance testing, we cannot determine if reported differences are real effects or sampling variance. For instance, comparing s1 acceptance across models: o3=67.0%, o4-mini=82.0%, GPT-4.1=38.4%—is this difference statistically significant? If N=50 each, χ²(2)=30.2, p<0.0001, but the paper never reports this test.
Multiple comparison problem: With 6 strategies × 3 models = 18 acceptance rate comparisons plus multiple outcome metrics, the expected number of false positives at α=0.05 is approximately 0.9 false discoveries. The paper applies no multiple testing corrections (Bonferroni, FDR), creating high risk of reporting spurious findings. The emphasis on "up to 82.0%" (maximum across all combinations) raises cherry-picking concerns.
Claim: "82% acceptance rate" - Evidence exists (Table 1) but interpretation is weak. This represents o4-mini model only on s1 strategy only. Without confidence intervals or comparison to legitimate paper baselines, we cannot assess whether this is genuinely anomalous. The claim needs context: "82.0% acceptance [95% CI: XX-XX%, N=XX papers] for TooGoodGains strategy by o4-mini reviewer, significantly higher than both o3 (67.0%, p<0.001) and the ICLR baseline acceptance rate of 31%."
Claim: "Concern-acceptance conflict" - This phenomenon is well-documented in Table 2, showing reviewers flag integrity issues yet assign acceptance-level scores. However, "frequently" is vague when o4-mini shows 100% conflict but GPT-4.1 shows 0-75% depending on strategy. The quantification is present but incomplete without sample sizes (100% could be 10/10 or 1/1 papers). Justification is moderate—phenomenon clearly exists but magnitude requires more rigorous statistical support.
Claim: "Detection accuracy barely exceeding random chance" - This is misleading. Table 4 shows ReD achieves 67% accuracy vs. 50% random baseline, representing a +34% relative improvement (Cohen's h ≈ 0.34, small-medium effect). While 67% is indeed poor for production deployment, "barely exceeding random" is imprecise language that understates the actual improvement. The interpretation contradicts the data.
Crucial missing context: The paper never compares fabricated paper acceptance (82%) to legitimate paper acceptance through the same review system. Top AI conferences typically show 20-30% acceptance (ICLR 2025: 31%). If legitimate papers achieve 31% acceptance and fabricated papers achieve 82%, this represents a +51 percentage point difference (265% relative increase), suggesting large effect size (Cohen's h ≈ 1.1-1.3). But without this baseline comparison, the 82% finding lacks proper interpretation.
This discovery is a highlight of the paper—the finding that reviewers simultaneously flag integrity concerns yet assign acceptance-level scores reveals a fundamental architectural flaw in how LLMs generate concerns and scores independently without consistency enforcement. Table 2 effectively quantifies this, showing o4-mini exhibits 100% conflict for some strategies while even "strict" o3 shows 26-52% conflict.
The statistical issue is that we need sample sizes to properly interpret. If o4-mini shows 100% conflict, we must know: "100% (N=X papers with concerns out of Y total s1 papers) [Wilson 95% CI: XX%-100%]." Without this, the finding, while directionally clear, lacks statistical rigor for publication.
The claim that mitigation strategies show "marginal improvements" is imprecise. Quantitatively, ReD (o3) achieves 67% accuracy vs. 50% random—this is a 17 percentage point improvement, which is modest but meaningful. The paper provides no significance tests to confirm this exceeds random (though for n=100, a binomial exact test would give p<0.001).
More critically, the mitigation section feels rushed compared to the detailed main experiments. Tables 3-4 lack the depth of analysis in earlier sections, and no comparison to established detection methods from the literature is provided.
The research question is genuinely novel, addressing a unique convergence point not previously studied. While related areas exist (AI-generated text detection, LLM peer review systems, AI scientists, paper mills), BadScientist is the first to systematically study the adversarial interplay between fabrication-oriented research agents and LLM review systems.
Distinction from prior work:
Key gap filled: Previous work treats generation and reviewing as separate problems. BadScientist uniquely evaluates the coupled system under integrity-focused attacks, testing AI-only publication loops where AI generates AND reviews without human oversight.
The BadScientist framework represents a substantial methodological contribution as the first rigorous red-teaming framework specifically for scientific integrity in AI publication loops. The five atomic manipulation strategies provide a reproducible framework that can be adapted to test other review systems. Most importantly, the theoretical contributions—formal error guarantees through concentration bounds (Theorem 1, Corollary 2) and calibration analysis (Propositions 1-2)—are more rigorous than typical ad-hoc adversarial testing in AI safety literature.
The framework's practical value is substantial: it can be adapted to red-team LLM review systems before deployment, benchmark detection methods, and test mitigation strategies systematically. This addresses a critical need as venues like AAAI pilot LLM-assisted review in 2025.
High acceptance rates (82%) provide the first quantitative evidence of LLM reviewer vulnerability to fabricated papers. No prior research has documented this vulnerability with concrete numbers. The comparison to traditional paper mills (which manipulate human review at unknown rates) shows LLM reviewers may be MORE vulnerable than expected given their pattern-matching capabilities.
Concern-acceptance conflict is a highly novel discovery with no prior documentation in the literature. This reveals that LLMs generate concerns and review scores independently without consistency enforcement—a fundamental architectural limitation. The rates (o4-mini: 100% conflict for some strategies; o3: 26-52% conflict) are quantified but concerning.
Low detection effectiveness (57-67% vs. 50% random) shows dramatic degradation from standard AI detection tools that typically achieve 80-95% accuracy on benign text. This aligns with adversarial ML literature showing model brittleness under attack but provides the first evidence in scientific peer review context.
This work addresses a concrete, tangible safety risk—demonstrating potential for real harm from AI-only loops without human oversight. The finding represents an alignment failure: current LLMs fail to align detection concerns with decision-making, voicing concerns yet recommending acceptance.
Connection to AI safety literature: The work follows established adversarial robustness frameworks (CSET 2021, NIST AI RMF 100-2e2025) and addresses AI misuse concerns documented by He et al. (2023) on AI4Science risks. It provides rare empirical evidence quantifying failure modes rather than theoretical speculation.
Critical policy implications: AAAI's 2025 pilot for LLM-assisted review explicitly maintains human oversight—BadScientist validates this conservative approach. The findings suggest need for mandatory disclosure requirements, provenance verification systems, and defense-in-depth safeguards. As research agents become more capable (Google's AI Co-Scientist, Sakana's AI Scientist), vulnerability likely increases, making this a pressing concern.
The work addresses an urgent threat to scientific integrity. Recent research shows paper mills are "large, resilient, and growing rapidly" (PNAS 2025), outpacing legitimate publications in some fields. BadScientist demonstrates that LLMs could automate and scale paper mill operations by 100x—one estimate suggests AI Scientist produces papers at ~$15/paper cost.
Critical vulnerability: Traditional paper mill detection uses image forensics (duplicate images, figure manipulation), but AI-generated papers exhibit no image duplication patterns, have grammatically perfect text (vs. paper mill tell-tale phrases), show internally consistent fabrications (vs. reused templates), and scale at near-zero marginal cost. This makes them MORE dangerous than traditional paper mills.
Trust erosion risk: Paper-mill articles receive median 11 citations, meaning fabrications pollute scientific literature permanently. If fabrications become indistinguishable from genuine research, scientific epistemology breaks down. This is particularly dangerous for evidence-based policy—medical guidelines and public health policy rely on scientific literature integrity.
Immediate research influence (1-2 years): Will likely influence NeurIPS, ICML, ICLR policies on AI-assisted submission/review. Expected high citation count from AI safety, scientific integrity, and NLP communities. Timing coincides with AAAI 2025 LLM review pilot, increasing policy relevance.
Policy impact: Major publishers (Springer Nature, Elsevier) likely to update AI usage guidelines. Top ML conferences may require mandatory AI disclosure, code/data availability for experiment verification, and artifact badges for reproducibility. Funding agencies (NIH, NSF) may mandate human oversight for AI-assisted research.
Industry influence: Detection tool companies (Turnitin, Copyleaks, Clear Skies) will likely incorporate findings. LLM providers may use results for alignment training. Conference management systems (OpenReview) may add safeguards.
Limitations to impact: Requires sustained attention—academic fraud research is often ignored until crisis. Economic incentives (publish-or-perish) remain unchanged. International coordination for enforcement faces challenges.
The writing demonstrates strong academic voice with appropriate technical precision throughout most sections. The abstract is excellent—concisely summarizing motivation, methods, key findings (82% acceptance, concern-acceptance conflict), and implications. The conclusion is particularly well-written and accessible, effectively conveying urgency without hyperbole.
However, the paper suffers from excessive technical complexity in Sections 3.1-3.5 that creates unnecessary barriers to understanding. The methodology sections front-load heavy mathematical notation before providing intuition, making the paper less accessible than warranted given its importance to the broader scientific community.
The title is clear, informative, and appropriately provocative. "BadScientist" is memorable and effectively frames the adversarial nature. The question format engages readers while precisely specifying the research scope (Research Agent, Unsound Papers, LLM Reviewers).
The paper follows a logical progression (Introduction → Related Work → Design → Experiments → Mitigation → Conclusion), but Section 3.5 (Theoretical Reliability) disrupts narrative flow. This heavy theoretical analysis with concentration bounds and calibration proofs, while rigorous, interrupts the experimental narrative. It should be moved to Appendix with only a 2-3 sentence summary in the main text.
The mitigation section feels rushed compared to the detailed main experiments, lacking the depth and analysis present in earlier sections. This imbalance suggests uneven development across the paper's components.
Section 3.1 (Preliminaries) presents a critical clarity problem. The paper introduces heavy mathematical notation (𝒫 for paper space, ℛ for reviews, 𝒮 for seed prompts, probability simplex Δ^|M|, scoring functional g_M: ℝ^K|M| → ℝ) before providing intuitive explanation of what these represent. The scoring functional g_M is presented abstractly as a mathematical mapping when it's simply the "overall assessment score"—this should be stated upfront.
Recommendation: Rewrite to lead with intuition, then formalism. For example:
"We generate fake papers from research topics using six manipulation strategies. Multiple LLM reviewers score each paper, and we aggregate their scores to make accept/reject decisions. Formally, let 𝒫 denote the paper space..."
Section 3.4 (Threshold Calibration) mixes formalism (Equations 1-2 with percentile notation) with plain language awkwardly. Instead of Equation 2, simply write: "We set τ_rate to match ICLR 2025's historical acceptance rate (31%)."
The paper uses several terms somewhat interchangeably:
Recommendation: Create consistent vocabulary. Use "fabricated" exclusively for papers, "integrity concern" consistently, and define "unsound" explicitly (currently only implied).
Inconsistent tone: The introduction is dramatic ("stark," "pervasive," "systematically fails"), the results section is appropriately measured, but the conclusion returns to dramatic language ("critical vulnerability," "integrity of scientific knowledge itself is at stake"). While urgency is justified, the tonal shifts are noticeable.
Colloquial language: "Flag-happy" (describing o3's sensitivity) is too colloquial for academic writing. Use "more sensitive to integrity concerns" or "more conservative in flagging issues."
Sentence length variation: Some sentences exceed 40 words, particularly in Section 3, making them difficult to parse. Example: "Our generator employs presentation-manipulation strategies requiring no real experiments, instead employing five presentation-manipulation strategies: exaggerating performance gains (TooGoodGains), cherry-picking..." This sentence could be split into two for improved clarity.
Current balance: Technical Depth: 8/10, Accessibility: 5/10. The paper is readable by ML researchers but too dense for the broader audience this important message should reach. The heavy mathematical front-loading in Section 3.1, concentration bounds interrupting narrative flow in 3.5, and jargon density ("Sub-Gaussian noise," "Lipschitz aggregation," "isotonic regression") without brief explanations create barriers.
Recommendation: Add plain-language summary boxes after formal definitions. For example, after Section 3.1's formal framework, add:
In plain terms: We generate fake papers from research topics using six manipulation strategies. Multiple LLM reviewers score each paper, and we aggregate their scores to make accept/reject decisions.
1. Addresses Critical, Timely Problem - The convergence of AI research agents and LLM reviewers creating automated publication loops represents a genuine threat to scientific integrity. The timing is perfect as venues pilot LLM review systems.
2. Novel Discovery: Concern-Acceptance Conflict - The finding that reviewers simultaneously flag integrity issues yet recommend acceptance is genuinely surprising and reveals a fundamental architectural flaw in LLM decision-making. This concept alone merits publication.
3. Rigorous Theoretical Framework - The concentration bounds (Theorem 1, Corollary 2) and calibration analysis (Propositions 1-2) with formal proofs represent exceptional rigor. Figure 3's empirical validation of theoretical bounds demonstrates theory-practice alignment.
4. Comprehensive Multi-Model Evaluation - Testing three different LLM models (o3, o4-mini, GPT-4.1) reduces single-model biases and reveals important model-specific behaviors (o4-mini's permissiveness, o3's sensitivity, GPT-4.1's conservativeness).
5. Reproducible Framework - The five atomic manipulation strategies (TooGoodGains, BaselineSelect, StatTheater, CoherencePolish, ProofGap) provide a systematic, reproducible framework that other researchers can adapt and extend.
6. Transparent Ethical Considerations - The paper openly discusses dual-use concerns, limits artifact release responsibly, and addresses responsible disclosure—setting a good precedent for security research.
7. Clear Results Presentation - Figure 2 (score distributions) and Table 1 (acceptance/ICR rates) effectively communicate findings. The results narrative explains numbers rather than merely listing them, making interpretation accessible.
1. No Human Reviewer Baseline - The most critical flaw. Without comparing how human reviewers perform on the same fabricated papers, we cannot determine if 82% acceptance reflects AI-specific vulnerabilities or general review system weaknesses. This fundamentally limits the paper's conclusions about LLM reviewer vulnerabilities specifically.
2. No Legitimate Paper Control Group - Without evaluating legitimate papers through the same review system, we cannot determine if 82% acceptance is anomalous. The comparison to ICLR's 31% overall acceptance rate is suggestive but insufficient—we need the same papers, same review system comparison.
3. Absent Statistical Reporting - No confidence intervals, sample sizes, or significance tests for any experimental results despite claiming formal error guarantees. This disconnect between rigorous theory and insufficient empirical reporting is the paper's most severe technical weakness.
4. Limited Model Selection - Only three models from one provider (OpenAI/Microsoft). Missing major LLM families (Claude, Gemini, Llama) severely limits generalization claims. The GPT-5 availability timeline is unclear and requires clarification.
5. Single-Domain Calibration - Calibration exclusively on ICLR 2025 (AI/ML conference) with no testing on other scientific domains (medicine, physics, biology) means results may not generalize to other fields with different review cultures.
6. Temporal Inconsistencies - The arXiv number (2510.xxxx) suggests October 2024 submission, but ICLR 2025 decisions weren't public until January 2025. How was calibration performed? This timeline requires clarification.
7. Underdeveloped Mitigation Section - Only two mitigation strategies tested with insufficient detail. Missing comparisons to established detection methods from adversarial ML and AI security literature. No adaptive attacker scenarios where generators know about and attempt to evade detection.
8. Overly Technical Presentation - Section 3.1-3.5's heavy notation before intuition creates unnecessary barriers. The paper is less accessible than warranted given its importance to policy makers and the broader scientific community.
1. Add Human Reviewer Baseline (Essential)
2. Include Legitimate Paper Control Group (Essential)
3. Add Statistical Rigor to Experimental Results (Essential)
4. Clarify Model Selection and Availability (Critical)
5. Fully Specify Calibration Methodology (Critical)
6. Define All Metrics Formally (Critical)
7. Expand and Detail Mitigation Strategies (Major)
8. Cross-Venue Validation
9. Expand LLM Model Coverage
10. Improve Accessibility (Important)
11. Address Multiple Comparisons (Important)
12. Enhance Reproducibility (Important)
13. Fix Formatting Issues
14. Add Concrete Examples
15. Improve Cross-Referencing
16. Expand Strategy Motivation
17. Clarify "All" Strategy
18. Address Limitations Earlier
Figure 1 (Framework Overview): ★★★★☆ This visual summary of the two-agent system is excellent for conveying the high-level approach. The clear flow from Paper Agent → Review Agent with color coding effectively communicates the adversarial setup. However, some font sizes are too small for comfortable reading, and the placement of "GPT-5 checking for integrity concerns" feels like an afterthought in the layout. Consider adding numbered workflow steps (1→2→3) to further clarify the process flow.
Figure 2 (Score Distributions): ★★★★★ This is an exceptional visualization—arguably the paper's best figure. The consistent 3×6 grid layout (3 models × 6 strategies) with clear threshold line marking acceptance boundary makes cross-condition comparisons intuitive. The visual immediately communicates that o4-mini is right-shifted (more permissive), o3 shows larger variance with fatter tails, and GPT-4.1 is more conservative. Minor improvement: Y-axis could say "Frequency" instead of "Count," and consider adding mean scores as vertical lines.
Figure 3 (Theoretical Validation): ★★★★☆ The three-panel layout effectively demonstrates that theoretical bounds hold empirically—a critical validation of the framework's rigor. However, this is dense technical content that may need more caption explanation for accessibility.
Table 1 (ACPT-ICR Main Results): ★★★★★ This is outstanding presentation of extensive information in a clear, scannable format. The dual threshold comparison (τ_rate vs τ_0.5) provides valuable sensitivity analysis. The ICR-m columns for individual models plus ICR@M aggregate are comprehensive. However, the table would benefit from:
Table 2 (Concern-Acceptance Conflict): ★★★★☆ The simple 3×6 layout directly addresses the key paradox and is easy to scan. Percentages clearly show the concerning pattern. Critical missing element: no sample size indicators. A cell showing "100%" could represent 10/10 papers or 1/1 paper—the precision is dramatically different. Add N values in parentheses: "100% (10/10)" or as a separate row.
Would also benefit from a "Total" column showing overall conflict rate per model, and clarification whether 0.0% cells (e.g., GPT-4.1/s4) indicate zero samples or true 0% conflict.
Tables 3-4 (Mitigation Results): ★★★☆☆ These tables present appropriate metrics (TPR, FPR, Accuracy, F1) and importantly include random guess baseline. However, the format is inconsistent with Tables 1-2, creating visual discontinuity. Table 4 is very dense with repeated columns showing some redundancy (TPR and FPR sum to context in binary classification). Recommendation: Combine Tables 3-4 into a single comprehensive mitigation table with better organization. Consider whether all metrics are necessary or if Accuracy + F1 would suffice.
The most significant issue is not figure/table design but what data is absent:
Missing visualizations that would strengthen the paper:
Missing data that should be reported:
The related work section is thorough, positioning BadScientist at the intersection of four research areas: (1) AI-generated text and detection, (2) LLM-based peer review, (3) AI research agents and "AI scientists," and (4) scientific misconduct and paper mills. This multi-disciplinary positioning is appropriate and well-executed.
Strengths:
Key citations present:
Adversarial ML Literature: While the paper claims novelty in adversarial testing of LLM review systems, it could more thoroughly engage with the adversarial ML literature on:
AI Safety and Alignment: The paper addresses alignment failures (concern-acceptance conflict) but doesn't cite core alignment literature. Relevant works:
Peer Review Research: The paper could benefit from citing traditional peer review research:
Detection and Verification Methods: Missing references to:
Scientific Integrity Policy: Could strengthen with references to:
Strengths:
Concerns:
The related work effectively motivates the paper by showing:
The transitions from related work to the paper's contributions are clear. However, the paper could more explicitly discuss how its findings relate back to this literature in the discussion/conclusion. For example:
Essential additions:
Valuable additions: 5. Traditional peer review research on reliability and biases 6. Detection method literature from adversarial ML 7. Publisher and funding agency policies on AI in research 8. Multi-agent debate and uncertainty quantification methods
Format suggestions: 9. Create a related work comparison table showing how BadScientist differs from each cited system 10. Add a timeline figure showing convergence of AI research agents and LLM review systems
This paper addresses a critical, timely problem at the intersection of AI safety and scientific integrity with genuinely novel contributions (concern-acceptance conflict discovery, rigorous adversarial evaluation framework, quantitative vulnerability assessment). However, fundamental methodological gaps prevent acceptance in current form. Most critically, the absence of human reviewer baselines and legitimate paper control groups makes it impossible to determine whether findings reflect LLM-specific vulnerabilities or general review system weaknesses. Additionally, the disconnect between rigorous theoretical foundations and insufficient statistical reporting in experimental results substantially limits the strength of conclusions.
Why not "Minor Revisions": The required changes are substantial and fundamental:
These are not cosmetic changes but core methodological improvements that substantially strengthen the scientific foundation. Estimated revision time: 3-6 months.
Why not "Reject": Despite methodological limitations, the core contributions are valuable:
Essential changes (must address):
Important changes (should address): 8. Cross-venue validation beyond ICLR 9. Include diverse LLM families (Claude, Gemini, Llama) 10. Improve accessibility by restructuring Sections 3.1-3.5 11. Add concrete fabricated paper examples 12. Enhance reproducibility with code, data, prompts
With these revisions, this work would merit acceptance at a top-tier venue (NeurIPS, ICLR, AAAI, FAccT).
If properly revised, this work would be highly influential:
Research impact: Will likely become a foundational reference for:
Expected high citation count (100+ citations within 2 years) given the intersection of hot topics (AI safety, scientific integrity, LLM evaluation) and policy relevance.
Policy impact:
Practical impact:
Top-tier AI venue criteria (NeurIPS, ICML, ICLR, AAAI):
✓ Novel problem formulation - First systematic adversarial evaluation of LLM review systems ✓ Methodological rigor - Theoretical framework with formal guarantees (when revised) ✓ Significant findings - Concern-acceptance conflict is genuinely surprising ✓ Broad impact - Affects AI safety, publishing systems, policy ✗ Technical quality - Currently limited by missing baselines and statistical issues ✓ Timeliness - Critical as AI review systems deploy
Current state: 3.5/5 stars - valuable contribution with significant limitations After revisions: 4.5/5 stars - strong paper meriting top-tier publication
Primary venue recommendation: NeurIPS or ICLR given:
Alternative venues:
Timeline recommendation:
The paper would merit acceptance if revised to:
These revisions would transform the paper from "interesting but limited" to "significant contribution with practical impact."
This work tackles a genuinely important problem that sits at the dangerous intersection of AI capability advancement and scientific integrity. The concern-acceptance conflict discovery is novel and reveals a fundamental architectural limitation in how LLMs make decisions. The theoretical framework is rigorous, the findings are concerning, and the timing is critical.
However, the paper in its current form makes claims that aren't fully supported by the evidence presented. Adding human baselines, legitimate paper controls, and comprehensive statistical analysis would substantially strengthen the scientific foundation and enable much stronger conclusions about LLM reviewer vulnerabilities specifically.
The core insight is valuable; the execution needs strengthening. With major revisions addressing the identified gaps, this would be an influential paper that shapes how the scientific community approaches AI-assisted research and review. The work deserves publication—but not yet.
Recommendation: Major Revisions with encouragement to resubmit after addressing critical issues.
Methodology: 6/10 - Conceptually sound but missing critical controls (human baseline, legitimate papers) Results: 5/10 - Important findings but insufficient statistical reporting Novelty: 9/10 - Highly novel research question and concern-acceptance conflict discovery Significance: 9/10 - Critical importance to AI safety and scientific integrity Writing: 7/10 - Generally strong but overly technical in places Figures/Tables: 8/10 - Excellent visualizations but missing data (CIs, sample sizes) Overall: 7/10 - Valuable contribution requiring substantial methodological strengthening
Estimated impact if properly revised: Very High (top 10% of papers in AI safety/integrity intersection)