Content is user-generated and unverified.

BadScientist: A Critical Vulnerability in AI-Driven Peer Review

Comprehensive Peer Review of arXiv 2510.18003v1

Overall Recommendation: Major Revisions Required

Bottom Line: This paper makes a timely, novel, and important contribution by revealing systematic vulnerabilities in LLM-based peer review systems. The finding that fabricated papers achieve 82% acceptance rates—and the discovery of "concern-acceptance conflict" where reviewers flag integrity issues yet recommend acceptance—represents a significant warning for scientific publishing. However, critical methodological gaps, particularly the absence of human reviewer baselines and insufficient statistical reporting, substantially limit the strength of conclusions. With substantial revisions addressing these issues, this work has the potential for high impact at a top-tier venue.


1. Summary of Main Contributions and Research Question

The paper investigates a critical question at the intersection of AI safety and scientific integrity: Can AI research agents generate convincing but unsound papers that deceive LLM-based review systems? This addresses an urgent concern as the scientific community increasingly adopts LLM-powered research assistants for paper generation and LLM-based systems for peer review, creating potential for fully automated "AI-only publication loops" without human oversight.

Primary Contributions:

The BadScientist framework consists of two adversarial components: a Paper Agent that generates fabricated papers using five presentation-manipulation strategies (TooGoodGains, BaselineSelect, StatTheater, CoherencePolish, ProofGap) requiring no real experiments, and a Review Agent that evaluates papers using multiple LLM models (o3, o4-mini, GPT-4.1) calibrated against ICLR 2025 data.

Key empirical findings include fabricated papers achieving acceptance rates up to 82.0%, with the discovery of a novel "concern-acceptance conflict" phenomenon where reviewers frequently flag integrity issues yet assign acceptance-level scores. Tested mitigation strategies (Review-with-Detection and Detection-Only) show limited effectiveness, with detection accuracy barely exceeding random chance (57-67% vs. 50% baseline).

Theoretical contributions include formal error guarantees through concentration bounds (Theorem 1: Bernstein-McDiarmid bounds for ensemble scoring) and calibration analysis (Propositions 1-2 on threshold selection), with empirical validation demonstrating these bounds hold in practice.


2. Assessment of Methodology

Experimental Design: Strengths and Critical Gaps

The BadScientist framework is well-conceived conceptually. The bi-directional adversarial setup (generator vs. reviewer) appropriately mirrors real attack scenarios, and the multi-model review approach (o3, o4-mini, GPT-4.1) reduces single-model biases. The use of formal error guarantees through concentration bounds adds theoretical rigor often absent in adversarial ML research.

However, critical experimental design flaws severely limit interpretability:

Missing Controls: The paper lacks essential baseline comparisons. Most critically, there is no evaluation of how human reviewers perform on the same fabricated papers, making it impossible to determine whether the 82% acceptance rate reflects AI-specific vulnerabilities or general review system weaknesses. Additionally, no legitimate papers are evaluated through the same review system to establish baseline acceptance rates. This fundamental omission means we cannot determine if 82% acceptance is anomalous.

Ecological Validity Concerns: The laboratory setting with seed topics and single-shot reviews doesn't reflect real peer review, which includes author rebuttals, reviewer discussions, and meta-reviews. The paper doesn't test whether adversaries could adapt strategies based on rejections. Moreover, results may be highly sensitive to exact prompts used (not disclosed in detail), raising questions about reproducibility and generalizability.

Single-Domain Testing: Calibration exclusively on ICLR 2025 (AI/ML conference) raises serious generalization concerns. Different scientific fields have vastly different review cultures, standards, and susceptibility to manipulation. Medical journals requiring data availability, for instance, might show dramatically different vulnerability patterns.

Manipulation Strategies: Realistic but Incomplete

The five atomic strategies show mixed realism:

TooGoodGains and BaselineSelect are well-documented phenomena in actual scientific misconduct, reflecting real pressure to publish extraordinary results and known cherry-picking behaviors.

StatTheater is partially problematic: while polished presentation with "forthcoming" repository links reflects real manipulation, fabricating precise p-values without actual experiments risks detection through statistical impossibility. Real statistical tests on fabricated data may produce internally inconsistent results that careful reviewers should catch.

CoherencePolish (professional formatting) is concerning as a manipulation strategy. That this improves acceptance suggests the review system is evaluating surface-level presentation rather than content validity—a finding about reviewer quality, not paper convincingness.

ProofGap is critically underspecified. The paper doesn't detail how proof gaps are concealed, yet this is arguably the most intellectually challenging manipulation and needs substantial elaboration.

Critical Missing Elements: No strategy addresses fundamental issues that fabricated papers should exhibit: lack of reproducible methods, inconsistent internal logic across sections, or absence of domain-specific technical depth that experts would recognize. The strategies also don't include adversarial evasion techniques (e.g., prompt injection defenses against detection systems).

LLM Model Selection Issues

Critical concern with model naming: The paper lists models as "o3, o4-mini, GPT-4.1, GPT-5" but the arXiv timestamp (2510.xxxx, suggesting October 2024 submission) creates temporal inconsistencies. While o3, o4-mini, and GPT-4.1 are confirmed 2025 models, GPT-5's status and availability for research use in October 2024 is unclear and requires clarification.

Selection justification absent: No rationale explains why these specific models were chosen. Critically, the paper omits major LLM families including Claude (Anthropic), Gemini (Google), and Llama (Meta), severely limiting conclusions about LLM reviewers generally. The paper doesn't specify exact model versions with timestamps, making replication difficult. Additionally, why is GPT-5 (ostensibly the most capable model) relegated to integrity checking rather than primary review?

Calibration Approach: Rigorous but Problematic

Strengths: Using ICLR 2025 data (11,565 submissions, 32.08% acceptance rate) provides substantial calibration data from a prestigious venue with rigorous standards. The formal approach using concentration bounds is technically sound.

Critical concerns:

Temporal validity issue: The arXiv number suggests October 2024 submission, but ICLR 2025 decisions weren't released until January 2025. This timeline inconsistency requires clarification—how was calibration performed?

Distribution mismatch: ICLR is domain-specific (ML/AI). Calibrating on ICLR may not generalize to biology, medicine, physics, or social sciences where review cultures differ fundamentally.

Potential contamination: ICLR 2025 reportedly introduced an LLM review feedback agent (documented in related research). If calibration data includes LLM-influenced reviews, this creates circular contamination that undermines the baseline.

Methodological opacity: The paper never specifies HOW calibration was performed—which review aspects were calibrated (scores, decision boundaries, concern thresholds)? How many ICLR papers were used? Were fabricated papers matched to ICLR topic distribution? This lack of transparency is a major reproducibility issue.

Statistical Methods: Theoretical Excellence, Empirical Deficiency

Concentration bounds (Hoeffding, McDiarmid) are appropriate for providing finite-sample guarantees, and the theoretical analysis (Appendix A.2) is exemplary with rigorous proofs and empirical validation (Figure 3). However, standard concentration bounds assume i.i.d. samples, but review outcomes may be correlated (same LLM, similar prompts), and papers from the same seed topics may share features. The paper doesn't address this dependence structure.

Calibration analysis is partially appropriate but incomplete. While calibration is crucial for LLM evaluation, the paper doesn't distinguish between model calibration (confidence matches accuracy), cross-calibration between LLMs, and calibration to human review distributions. Standard calibration metrics (Expected Calibration Error, Maximum Calibration Error, reliability diagrams) are not reported.

Critical missing methods:

  • No statistical power analysis demonstrating sample sizes are adequate
  • No multiple testing corrections despite testing 6 strategies × 3 models = 18 comparisons, creating high risk of false positives
  • No bootstrapping for robust confidence intervals
  • No sensitivity analysis showing results are robust to statistical assumptions

Evaluation Metrics: Adequate but Incomplete

ACPT (Acceptance Rate) is a direct measure of the research question but has significant limitations. It's a binary metric that doesn't capture reviewer confidence degrees. Most critically, without acceptance rates for legitimate papers in the same system, we cannot interpret whether 82% is anomalous. The paper also doesn't report score distributions—all acceptances aren't equal (score of 6.5 vs. 8.0 matters).

ICR (Integrity Concern Rate) definition is unclear. Based on context, it measures when reviewers flag concerns yet accept papers, but the operational definition is never formally stated. What counts as "flagging" a concern? Are all concerns weighted equally? This non-standard metric needs precise mathematical definition.

Critical missing metrics:

  • Review quality measures (length, specificity, concern-to-score correlation)
  • Inter-model agreement (Fleiss' kappa)
  • Detection metrics (precision, recall, F1, ROC curves)
  • False positive rates for legitimate papers
  • Human-LLM comparison metrics

Mitigation Strategies: Underspecified and Limited Scope

The paper tests only two mitigation approaches with insufficient detail. Based on naming, ReD appears to be "Review-with-Detection" (bundling conventional review with mandatory AI-content detection) and DetOnly is "Detection-Only" (pure detection without review), but exact implementations are not fully specified.

Critical limitations:

Missing essential baselines: Should compare against statistical forensics (detecting impossible p-value distributions), plagiarism detection (template reuse), adversarial training (training reviewers on fabricated examples), multi-stage review with human verification, and ensemble detection methods.

No adversarial robustness: Are mitigations robust to adaptive attackers? If generators know about ReD/DetOnly, can they evade them? No adversarial evasion testing is reported.

Missing from security literature: The paper doesn't incorporate detection methods from adversarial ML research (input sanitization, uncertainty quantification, anomaly detection) or multi-agent debate systems shown to improve LLM accuracy.


3. Evaluation of Results and Interpretation

Statistical Reporting: Critical Deficiencies

The paper's most severe weakness is the disconnect between rigorous theoretical foundations and insufficient empirical reporting. While Appendix A.2 provides exceptional theoretical analysis with formal proofs and guarantees, the main experimental results lack fundamental statistical measures.

Critical omissions in all results tables:

  • No confidence intervals for any acceptance rates or ICR values despite the paper claiming formal error guarantees
  • No sample sizes reported per strategy/model combination, making it impossible to assess statistical reliability
  • No error bars on Figure 2 (score distributions)
  • No variance metrics (standard deviations, standard errors, interquartile ranges)

According to standard statistical reporting guidelines (APA, CONSORT), all effect estimates must include point estimate, 95% confidence interval, sample size, and effect size measure. This paper provides only point estimates.

Example of the problem: The abstract reports "acceptance rates up to 82.0%" but we don't know if this represents 82/100 papers (95% CI: 73-89%) or 41/50 papers (95% CI: 68-92%)—a critical difference for interpretation.

Statistical Significance Testing: Completely Absent

The paper reports no statistical significance tests comparing:

  • Acceptance rates across strategies (67.0% vs 82.0% vs 32.0%)
  • ICR rates across models (50.6% vs 2.3% vs 4.7%)
  • Concern-acceptance conflict rates (100% vs 33% vs 50%)

Without significance testing, we cannot determine if reported differences are real effects or sampling variance. For instance, comparing s1 acceptance across models: o3=67.0%, o4-mini=82.0%, GPT-4.1=38.4%—is this difference statistically significant? If N=50 each, χ²(2)=30.2, p<0.0001, but the paper never reports this test.

Multiple comparison problem: With 6 strategies × 3 models = 18 acceptance rate comparisons plus multiple outcome metrics, the expected number of false positives at α=0.05 is approximately 0.9 false discoveries. The paper applies no multiple testing corrections (Bonferroni, FDR), creating high risk of reporting spurious findings. The emphasis on "up to 82.0%" (maximum across all combinations) raises cherry-picking concerns.

Results Interpretation: Partially Justified

Claim: "82% acceptance rate" - Evidence exists (Table 1) but interpretation is weak. This represents o4-mini model only on s1 strategy only. Without confidence intervals or comparison to legitimate paper baselines, we cannot assess whether this is genuinely anomalous. The claim needs context: "82.0% acceptance [95% CI: XX-XX%, N=XX papers] for TooGoodGains strategy by o4-mini reviewer, significantly higher than both o3 (67.0%, p<0.001) and the ICLR baseline acceptance rate of 31%."

Claim: "Concern-acceptance conflict" - This phenomenon is well-documented in Table 2, showing reviewers flag integrity issues yet assign acceptance-level scores. However, "frequently" is vague when o4-mini shows 100% conflict but GPT-4.1 shows 0-75% depending on strategy. The quantification is present but incomplete without sample sizes (100% could be 10/10 or 1/1 papers). Justification is moderate—phenomenon clearly exists but magnitude requires more rigorous statistical support.

Claim: "Detection accuracy barely exceeding random chance" - This is misleading. Table 4 shows ReD achieves 67% accuracy vs. 50% random baseline, representing a +34% relative improvement (Cohen's h ≈ 0.34, small-medium effect). While 67% is indeed poor for production deployment, "barely exceeding random" is imprecise language that understates the actual improvement. The interpretation contradicts the data.

High Acceptance Rates in Context

Crucial missing context: The paper never compares fabricated paper acceptance (82%) to legitimate paper acceptance through the same review system. Top AI conferences typically show 20-30% acceptance (ICLR 2025: 31%). If legitimate papers achieve 31% acceptance and fabricated papers achieve 82%, this represents a +51 percentage point difference (265% relative increase), suggesting large effect size (Cohen's h ≈ 1.1-1.3). But without this baseline comparison, the 82% finding lacks proper interpretation.

Concern-Acceptance Conflict: Novel and Important

This discovery is a highlight of the paper—the finding that reviewers simultaneously flag integrity concerns yet assign acceptance-level scores reveals a fundamental architectural flaw in how LLMs generate concerns and scores independently without consistency enforcement. Table 2 effectively quantifies this, showing o4-mini exhibits 100% conflict for some strategies while even "strict" o3 shows 26-52% conflict.

The statistical issue is that we need sample sizes to properly interpret. If o4-mini shows 100% conflict, we must know: "100% (N=X papers with concerns out of Y total s1 papers) [Wilson 95% CI: XX%-100%]." Without this, the finding, while directionally clear, lacks statistical rigor for publication.

Mitigation Results: Underdeveloped

The claim that mitigation strategies show "marginal improvements" is imprecise. Quantitatively, ReD (o3) achieves 67% accuracy vs. 50% random—this is a 17 percentage point improvement, which is modest but meaningful. The paper provides no significance tests to confirm this exceeds random (though for n=100, a binomial exact test would give p<0.001).

More critically, the mitigation section feels rushed compared to the detailed main experiments. Tables 3-4 lack the depth of analysis in earlier sections, and no comparison to established detection methods from the literature is provided.


4. Analysis of Novelty and Significance to the Field

Research Novelty: Highly Novel and Timely

The research question is genuinely novel, addressing a unique convergence point not previously studied. While related areas exist (AI-generated text detection, LLM peer review systems, AI scientists, paper mills), BadScientist is the first to systematically study the adversarial interplay between fabrication-oriented research agents and LLM review systems.

Distinction from prior work:

  • AI-generated text detection research (Weber-Wulff et al. 2023, DetectGPT) focuses on detecting AI authorship generally, not adversarial fabrication designed to deceive reviewers
  • LLM peer review research (Liang et al. 2024, Hosseini & Horbach 2023) explores LLMs as review assistants but doesn't evaluate vulnerability to adversarial manipulation
  • AI Scientist systems (Lu et al. 2024, FutureHouse) demonstrate end-to-end research automation under benign assumptions without exploring malicious use
  • Paper mill detection (Bik et al., PNAS 2025) uses image forensics on human-generated fraud, which won't work on AI-generated text

Key gap filled: Previous work treats generation and reviewing as separate problems. BadScientist uniquely evaluates the coupled system under integrity-focused attacks, testing AI-only publication loops where AI generates AND reviews without human oversight.

Framework Contribution: Significant Methodological Advance

The BadScientist framework represents a substantial methodological contribution as the first rigorous red-teaming framework specifically for scientific integrity in AI publication loops. The five atomic manipulation strategies provide a reproducible framework that can be adapted to test other review systems. Most importantly, the theoretical contributions—formal error guarantees through concentration bounds (Theorem 1, Corollary 2) and calibration analysis (Propositions 1-2)—are more rigorous than typical ad-hoc adversarial testing in AI safety literature.

The framework's practical value is substantial: it can be adapted to red-team LLM review systems before deployment, benchmark detection methods, and test mitigation strategies systematically. This addresses a critical need as venues like AAAI pilot LLM-assisted review in 2025.

Findings Novelty: Surprising and Unprecedented

High acceptance rates (82%) provide the first quantitative evidence of LLM reviewer vulnerability to fabricated papers. No prior research has documented this vulnerability with concrete numbers. The comparison to traditional paper mills (which manipulate human review at unknown rates) shows LLM reviewers may be MORE vulnerable than expected given their pattern-matching capabilities.

Concern-acceptance conflict is a highly novel discovery with no prior documentation in the literature. This reveals that LLMs generate concerns and review scores independently without consistency enforcement—a fundamental architectural limitation. The rates (o4-mini: 100% conflict for some strategies; o3: 26-52% conflict) are quantified but concerning.

Low detection effectiveness (57-67% vs. 50% random) shows dramatic degradation from standard AI detection tools that typically achieve 80-95% accuracy on benign text. This aligns with adversarial ML literature showing model brittleness under attack but provides the first evidence in scientific peer review context.

Significance to AI Safety: Critical Importance

This work addresses a concrete, tangible safety risk—demonstrating potential for real harm from AI-only loops without human oversight. The finding represents an alignment failure: current LLMs fail to align detection concerns with decision-making, voicing concerns yet recommending acceptance.

Connection to AI safety literature: The work follows established adversarial robustness frameworks (CSET 2021, NIST AI RMF 100-2e2025) and addresses AI misuse concerns documented by He et al. (2023) on AI4Science risks. It provides rare empirical evidence quantifying failure modes rather than theoretical speculation.

Critical policy implications: AAAI's 2025 pilot for LLM-assisted review explicitly maintains human oversight—BadScientist validates this conservative approach. The findings suggest need for mandatory disclosure requirements, provenance verification systems, and defense-in-depth safeguards. As research agents become more capable (Google's AI Co-Scientist, Sakana's AI Scientist), vulnerability likely increases, making this a pressing concern.

Significance to Scientific Publishing: Urgent and Profound

The work addresses an urgent threat to scientific integrity. Recent research shows paper mills are "large, resilient, and growing rapidly" (PNAS 2025), outpacing legitimate publications in some fields. BadScientist demonstrates that LLMs could automate and scale paper mill operations by 100x—one estimate suggests AI Scientist produces papers at ~$15/paper cost.

Critical vulnerability: Traditional paper mill detection uses image forensics (duplicate images, figure manipulation), but AI-generated papers exhibit no image duplication patterns, have grammatically perfect text (vs. paper mill tell-tale phrases), show internally consistent fabrications (vs. reused templates), and scale at near-zero marginal cost. This makes them MORE dangerous than traditional paper mills.

Trust erosion risk: Paper-mill articles receive median 11 citations, meaning fabrications pollute scientific literature permanently. If fabrications become indistinguishable from genuine research, scientific epistemology breaks down. This is particularly dangerous for evidence-based policy—medical guidelines and public health policy rely on scientific literature integrity.

Potential Impact: High Likelihood of Substantial Influence

Immediate research influence (1-2 years): Will likely influence NeurIPS, ICML, ICLR policies on AI-assisted submission/review. Expected high citation count from AI safety, scientific integrity, and NLP communities. Timing coincides with AAAI 2025 LLM review pilot, increasing policy relevance.

Policy impact: Major publishers (Springer Nature, Elsevier) likely to update AI usage guidelines. Top ML conferences may require mandatory AI disclosure, code/data availability for experiment verification, and artifact badges for reproducibility. Funding agencies (NIH, NSF) may mandate human oversight for AI-assisted research.

Industry influence: Detection tool companies (Turnitin, Copyleaks, Clear Skies) will likely incorporate findings. LLM providers may use results for alignment training. Conference management systems (OpenReview) may add safeguards.

Limitations to impact: Requires sustained attention—academic fraud research is often ignored until crisis. Economic incentives (publish-or-perish) remain unchanged. International coordination for enforcement faces challenges.


5. Critical Assessment of Writing Quality, Clarity, and Organization

Overall Writing Assessment: Generally Strong with Accessibility Issues

The writing demonstrates strong academic voice with appropriate technical precision throughout most sections. The abstract is excellent—concisely summarizing motivation, methods, key findings (82% acceptance, concern-acceptance conflict), and implications. The conclusion is particularly well-written and accessible, effectively conveying urgency without hyperbole.

However, the paper suffers from excessive technical complexity in Sections 3.1-3.5 that creates unnecessary barriers to understanding. The methodology sections front-load heavy mathematical notation before providing intuition, making the paper less accessible than warranted given its importance to the broader scientific community.

Title: Excellent

The title is clear, informative, and appropriately provocative. "BadScientist" is memorable and effectively frames the adversarial nature. The question format engages readers while precisely specifying the research scope (Research Agent, Unsound Papers, LLM Reviewers).

Organization and Structure: Logical but Problematic Placement

The paper follows a logical progression (Introduction → Related Work → Design → Experiments → Mitigation → Conclusion), but Section 3.5 (Theoretical Reliability) disrupts narrative flow. This heavy theoretical analysis with concentration bounds and calibration proofs, while rigorous, interrupts the experimental narrative. It should be moved to Appendix with only a 2-3 sentence summary in the main text.

The mitigation section feels rushed compared to the detailed main experiments, lacking the depth and analysis present in earlier sections. This imbalance suggests uneven development across the paper's components.

Clarity Issues: Notation Overload Before Intuition

Section 3.1 (Preliminaries) presents a critical clarity problem. The paper introduces heavy mathematical notation (𝒫 for paper space, ℛ for reviews, 𝒮 for seed prompts, probability simplex Δ^|M|, scoring functional g_M: ℝ^K|M| → ℝ) before providing intuitive explanation of what these represent. The scoring functional g_M is presented abstractly as a mathematical mapping when it's simply the "overall assessment score"—this should be stated upfront.

Recommendation: Rewrite to lead with intuition, then formalism. For example:

"We generate fake papers from research topics using six manipulation strategies. Multiple LLM reviewers score each paper, and we aggregate their scores to make accept/reject decisions. Formally, let 𝒫 denote the paper space..."

Section 3.4 (Threshold Calibration) mixes formalism (Equations 1-2 with percentile notation) with plain language awkwardly. Instead of Equation 2, simply write: "We set τ_rate to match ICLR 2025's historical acceptance rate (31%)."

Terminology Inconsistency

The paper uses several terms somewhat interchangeably:

  • "Fabricated" vs "AI-generated" vs "unsound" for papers
  • "Integrity concerns" vs "integrity issues" vs "integrity-related concerns"
  • "Fool" (in title) vs "deceive" (in text)

Recommendation: Create consistent vocabulary. Use "fabricated" exclusively for papers, "integrity concern" consistently, and define "unsound" explicitly (currently only implied).

Writing Style Issues

Inconsistent tone: The introduction is dramatic ("stark," "pervasive," "systematically fails"), the results section is appropriately measured, but the conclusion returns to dramatic language ("critical vulnerability," "integrity of scientific knowledge itself is at stake"). While urgency is justified, the tonal shifts are noticeable.

Colloquial language: "Flag-happy" (describing o3's sensitivity) is too colloquial for academic writing. Use "more sensitive to integrity concerns" or "more conservative in flagging issues."

Sentence length variation: Some sentences exceed 40 words, particularly in Section 3, making them difficult to parse. Example: "Our generator employs presentation-manipulation strategies requiring no real experiments, instead employing five presentation-manipulation strategies: exaggerating performance gains (TooGoodGains), cherry-picking..." This sentence could be split into two for improved clarity.

Technical Depth vs. Accessibility: Too Technical

Current balance: Technical Depth: 8/10, Accessibility: 5/10. The paper is readable by ML researchers but too dense for the broader audience this important message should reach. The heavy mathematical front-loading in Section 3.1, concentration bounds interrupting narrative flow in 3.5, and jargon density ("Sub-Gaussian noise," "Lipschitz aggregation," "isotonic regression") without brief explanations create barriers.

Recommendation: Add plain-language summary boxes after formal definitions. For example, after Section 3.1's formal framework, add:

In plain terms: We generate fake papers from research topics using six manipulation strategies. Multiple LLM reviewers score each paper, and we aggregate their scores to make accept/reject decisions.


6. Identification of Strengths and Weaknesses

Major Strengths

1. Addresses Critical, Timely Problem - The convergence of AI research agents and LLM reviewers creating automated publication loops represents a genuine threat to scientific integrity. The timing is perfect as venues pilot LLM review systems.

2. Novel Discovery: Concern-Acceptance Conflict - The finding that reviewers simultaneously flag integrity issues yet recommend acceptance is genuinely surprising and reveals a fundamental architectural flaw in LLM decision-making. This concept alone merits publication.

3. Rigorous Theoretical Framework - The concentration bounds (Theorem 1, Corollary 2) and calibration analysis (Propositions 1-2) with formal proofs represent exceptional rigor. Figure 3's empirical validation of theoretical bounds demonstrates theory-practice alignment.

4. Comprehensive Multi-Model Evaluation - Testing three different LLM models (o3, o4-mini, GPT-4.1) reduces single-model biases and reveals important model-specific behaviors (o4-mini's permissiveness, o3's sensitivity, GPT-4.1's conservativeness).

5. Reproducible Framework - The five atomic manipulation strategies (TooGoodGains, BaselineSelect, StatTheater, CoherencePolish, ProofGap) provide a systematic, reproducible framework that other researchers can adapt and extend.

6. Transparent Ethical Considerations - The paper openly discusses dual-use concerns, limits artifact release responsibly, and addresses responsible disclosure—setting a good precedent for security research.

7. Clear Results Presentation - Figure 2 (score distributions) and Table 1 (acceptance/ICR rates) effectively communicate findings. The results narrative explains numbers rather than merely listing them, making interpretation accessible.

Major Weaknesses

1. No Human Reviewer Baseline - The most critical flaw. Without comparing how human reviewers perform on the same fabricated papers, we cannot determine if 82% acceptance reflects AI-specific vulnerabilities or general review system weaknesses. This fundamentally limits the paper's conclusions about LLM reviewer vulnerabilities specifically.

2. No Legitimate Paper Control Group - Without evaluating legitimate papers through the same review system, we cannot determine if 82% acceptance is anomalous. The comparison to ICLR's 31% overall acceptance rate is suggestive but insufficient—we need the same papers, same review system comparison.

3. Absent Statistical Reporting - No confidence intervals, sample sizes, or significance tests for any experimental results despite claiming formal error guarantees. This disconnect between rigorous theory and insufficient empirical reporting is the paper's most severe technical weakness.

4. Limited Model Selection - Only three models from one provider (OpenAI/Microsoft). Missing major LLM families (Claude, Gemini, Llama) severely limits generalization claims. The GPT-5 availability timeline is unclear and requires clarification.

5. Single-Domain Calibration - Calibration exclusively on ICLR 2025 (AI/ML conference) with no testing on other scientific domains (medicine, physics, biology) means results may not generalize to other fields with different review cultures.

6. Temporal Inconsistencies - The arXiv number (2510.xxxx) suggests October 2024 submission, but ICLR 2025 decisions weren't public until January 2025. How was calibration performed? This timeline requires clarification.

7. Underdeveloped Mitigation Section - Only two mitigation strategies tested with insufficient detail. Missing comparisons to established detection methods from adversarial ML and AI security literature. No adaptive attacker scenarios where generators know about and attempt to evade detection.

8. Overly Technical Presentation - Section 3.1-3.5's heavy notation before intuition creates unnecessary barriers. The paper is less accessible than warranted given its importance to policy makers and the broader scientific community.

Minor Weaknesses

  • Abstract contains formatting error ("acceptance rates up to .")
  • Figure 1 has small font sizes reducing readability
  • Table 2 lacks sample sizes making statistical interpretation difficult
  • Manipulation strategy rationale not explained (why these five specifically?)
  • "All" strategy composition unclear (additive? sequential?)
  • No concrete example of a fabricated paper provided
  • Limitations section appears after Conclusion (unconventional placement)
  • Missing discussion of potential for human-AI hybrid review systems

7. Specific Technical Comments and Suggestions for Improvement

Critical Issues Requiring Resolution

1. Add Human Reviewer Baseline (Essential)

  • Recruit human experts to review a sample of the same fabricated papers
  • Compare human vs. LLM acceptance rates, concern detection rates, and susceptibility to each manipulation strategy
  • This control is fundamental to validating claims about LLM-specific vulnerabilities

2. Include Legitimate Paper Control Group (Essential)

  • Evaluate legitimate papers through the same review system
  • Establish baseline acceptance rate for comparison
  • Ideally, use ICLR submissions to match domain and quality distribution

3. Add Statistical Rigor to Experimental Results (Essential)

  • Report sample sizes for every condition in all tables
  • Add 95% confidence intervals for all acceptance rates and ICR values
  • Conduct significance tests with multiple comparison corrections (Bonferroni or FDR)
  • Include effect size measures (Cohen's h, odds ratios) for key comparisons
  • Create supplementary table: [Strategy, N_papers, ACPT (95% CI), ICR (95% CI), vs. baseline p-value]

4. Clarify Model Selection and Availability (Critical)

  • Specify exact model versions with API timestamps
  • Clarify GPT-5 availability timeline and status
  • Justify why these specific models were chosen
  • Address the temporal inconsistency between arXiv submission date and ICLR 2025 data availability

5. Fully Specify Calibration Methodology (Critical)

  • Detail HOW calibration was performed, not just WHAT data was used
  • Specify which review aspects were calibrated (scores, decision boundaries, concern thresholds)
  • Report exactly how many ICLR papers were used
  • Describe how fabricated papers were matched to ICLR topic distribution
  • Provide sensitivity analysis showing robustness to calibration choices

6. Define All Metrics Formally (Critical)

  • Provide mathematical definition of ICR (currently only described intuitively)
  • Define "concern-acceptance conflict" operationally with clear thresholds
  • Specify what counts as "flagging" a concern
  • Clarify whether all concerns are weighted equally

7. Expand and Detail Mitigation Strategies (Major)

  • Fully specify what ReD and DetOnly implementations entail
  • Compare against established detection methods from literature:
    • Statistical forensics (impossible p-value distributions)
    • Plagiarism detection (template reuse)
    • Adversarial training on fabricated examples
    • Multi-stage review with human verification
    • Ensemble detection methods
  • Test robustness to adaptive attackers who know about and evade detection
  • Provide cost-benefit analysis for practical deployment

Important Improvements

8. Cross-Venue Validation

  • Test framework on other venues beyond ICLR (NeurIPS, CVPR, ICML for ML; add non-ML venues)
  • Evaluate generalization across scientific domains (medicine, physics, social sciences)
  • Compare calibration stability across different venues

9. Expand LLM Model Coverage

  • Include Claude (Anthropic), Gemini (Google), Llama (Meta)
  • Test whether findings generalize across different LLM architectures and providers
  • Evaluate vendor-specific biases

10. Improve Accessibility (Important)

  • Move Section 3.5 to Appendix, keep 2-3 sentence summary in main text
  • Rewrite Section 3.1 to lead with intuition before formalism
  • Add plain-language summary boxes after technical sections
  • Create notation reference table in appendix
  • Add algorithm pseudocode boxes for key procedures

11. Address Multiple Comparisons (Important)

  • Apply Bonferroni or FDR correction for the 18+ hypothesis tests implicit in results
  • Report adjusted p-values in tables
  • Discuss potential for false discoveries

12. Enhance Reproducibility (Important)

  • Release anonymized raw data (acceptance rates per paper)
  • Provide statistical analysis scripts (R/Python)
  • Share exact prompt templates (generator and reviewer) in appendix
  • Report random seeds and hyperparameters
  • Provide Docker container with exact environment

Minor Suggestions

13. Fix Formatting Issues

  • Correct missing percentage in abstract ("acceptance rates up to .")
  • Increase Figure 1 font sizes
  • Add sample sizes to Table 2 cells
  • Reorganize Tables 3-4 into single comprehensive mitigation table

14. Add Concrete Examples

  • Include one fabricated paper example in appendix showing how specific manipulations manifest
  • Provide before/after snippets for each strategy in supplementary materials

15. Improve Cross-Referencing

  • Some figures/tables referenced before adequate context
  • Add forward references: "our main results (Table 1, discussed in Section 4.2)"

16. Expand Strategy Motivation

  • Add 2-3 sentences in introduction explaining WHY these five strategies were chosen
  • Discuss what other strategies were considered and excluded

17. Clarify "All" Strategy

  • Specify how the five strategies are combined (sequential? simultaneous?)
  • Discuss potential interactions or synergies between strategies

18. Address Limitations Earlier

  • Move limitations section before Conclusion
  • Acknowledge specific limitations in methodology sections where relevant

8. Comments on Figures, Tables, and Data Presentation

Figure Quality: Generally Excellent

Figure 1 (Framework Overview): ★★★★☆ This visual summary of the two-agent system is excellent for conveying the high-level approach. The clear flow from Paper Agent → Review Agent with color coding effectively communicates the adversarial setup. However, some font sizes are too small for comfortable reading, and the placement of "GPT-5 checking for integrity concerns" feels like an afterthought in the layout. Consider adding numbered workflow steps (1→2→3) to further clarify the process flow.

Figure 2 (Score Distributions): ★★★★★ This is an exceptional visualization—arguably the paper's best figure. The consistent 3×6 grid layout (3 models × 6 strategies) with clear threshold line marking acceptance boundary makes cross-condition comparisons intuitive. The visual immediately communicates that o4-mini is right-shifted (more permissive), o3 shows larger variance with fatter tails, and GPT-4.1 is more conservative. Minor improvement: Y-axis could say "Frequency" instead of "Count," and consider adding mean scores as vertical lines.

Figure 3 (Theoretical Validation): ★★★★☆ The three-panel layout effectively demonstrates that theoretical bounds hold empirically—a critical validation of the framework's rigor. However, this is dense technical content that may need more caption explanation for accessibility.

Table Quality: Comprehensive but Improvable

Table 1 (ACPT-ICR Main Results): ★★★★★ This is outstanding presentation of extensive information in a clear, scannable format. The dual threshold comparison (τ_rate vs τ_0.5) provides valuable sensitivity analysis. The ICR-m columns for individual models plus ICR@M aggregate are comprehensive. However, the table would benefit from:

  • Visual highlighting (bold or shading) for notable values like 82.0%
  • Brief interpretation of key findings in the caption beyond just metric definitions
  • Confidence intervals for each value

Table 2 (Concern-Acceptance Conflict): ★★★★☆ The simple 3×6 layout directly addresses the key paradox and is easy to scan. Percentages clearly show the concerning pattern. Critical missing element: no sample size indicators. A cell showing "100%" could represent 10/10 papers or 1/1 paper—the precision is dramatically different. Add N values in parentheses: "100% (10/10)" or as a separate row.

Would also benefit from a "Total" column showing overall conflict rate per model, and clarification whether 0.0% cells (e.g., GPT-4.1/s4) indicate zero samples or true 0% conflict.

Tables 3-4 (Mitigation Results): ★★★☆☆ These tables present appropriate metrics (TPR, FPR, Accuracy, F1) and importantly include random guess baseline. However, the format is inconsistent with Tables 1-2, creating visual discontinuity. Table 4 is very dense with repeated columns showing some redundancy (TPR and FPR sum to context in binary classification). Recommendation: Combine Tables 3-4 into a single comprehensive mitigation table with better organization. Consider whether all metrics are necessary or if Accuracy + F1 would suffice.

Data Presentation: Critical Gaps

The most significant issue is not figure/table design but what data is absent:

Missing visualizations that would strengthen the paper:

  • Calibration curves (reliability diagrams) showing model calibration quality
  • ROC curves for detection methods showing precision-recall tradeoffs
  • Confidence interval plots showing uncertainty in acceptance rates across strategies
  • Correlation heatmap showing which strategies are detected by which models
  • Temporal analysis if papers were generated/reviewed across time periods

Missing data that should be reported:

  • Sample sizes for every cell in all tables
  • Confidence intervals for all point estimates
  • Raw score data (means, medians, quartiles) underlying Figure 2 histograms
  • Statistical test results (p-values, effect sizes) for key comparisons

Specific Recommendations

  1. Add supplementary table with detailed statistics: means, SDs, CIs, sample sizes, and pairwise test results
  2. Include sample sizes in Table 2 cells: "100% (N=X/Y)"
  3. Increase Figure 1 font sizes by 20-30%
  4. Add mean score lines to Figure 2 distributions
  5. Create combined, reorganized mitigation table from Tables 3-4
  6. Add reliability diagrams to appendix showing calibration quality
  7. Consider adding ROC curves for detection methods
  8. Use visual highlighting (bold, color, shading) to draw attention to key findings in tables

9. Assessment of References and Literature Review

Literature Coverage: Comprehensive and Well-Positioned

The related work section is thorough, positioning BadScientist at the intersection of four research areas: (1) AI-generated text and detection, (2) LLM-based peer review, (3) AI research agents and "AI scientists," and (4) scientific misconduct and paper mills. This multi-disciplinary positioning is appropriate and well-executed.

Strengths:

  • Cites seminal work in each area (DetectGPT, Weber-Wulff systematic review, AI Scientist, paper mill detection literature)
  • Includes recent developments (ICLR 2025 data, AAAI 2025 LLM review pilot, recent paper mill studies)
  • Covers both technical ML research and scientific integrity literature
  • References span multiple communities (NeurIPS, Nature, PNAS, arXiv)

Key citations present:

  • AI-generated text detection: Mitchell et al. 2023 (DetectGPT), Weber-Wulff et al. 2023, Gao et al. 2023
  • LLM review: Liang et al. 2024, Liu & Shah 2023, Hosseini & Horbach 2023
  • AI research agents: Lu et al. 2024 (AI Scientist), FutureHouse, Google AI Co-Scientist
  • Paper mills and integrity: Bik et al., PNAS 2025 studies, Congressional reports

Gaps and Missing References

Adversarial ML Literature: While the paper claims novelty in adversarial testing of LLM review systems, it could more thoroughly engage with the adversarial ML literature on:

  • Red-teaming methodologies (Perez et al. 2022, Anthropic's work)
  • Adversarial robustness evaluation frameworks (HELM - Liang et al. 2022)
  • Jailbreaking research that might inform evasion strategies

AI Safety and Alignment: The paper addresses alignment failures (concern-acceptance conflict) but doesn't cite core alignment literature. Relevant works:

  • CSET (Center for Security and Emerging Technology) frameworks on AI robustness
  • NIST AI Risk Management Framework (AI 100-2e2025) on adversarial ML
  • Recent work on scalable oversight and AI safety

Peer Review Research: The paper could benefit from citing traditional peer review research:

  • Studies on human reviewer biases and failure modes
  • Research on peer review reliability (inter-rater agreement)
  • Work on structured review protocols and checklists
  • Meta-science literature on review quality

Detection and Verification Methods: Missing references to:

  • Cryptographic provenance systems for scientific artifacts
  • Blockchain-based verification (if relevant to the domain)
  • Multi-agent debate systems (shown to improve LLM accuracy)
  • Uncertainty quantification methods for LLMs

Scientific Integrity Policy: Could strengthen with references to:

  • COPE (Committee on Publication Ethics) guidelines
  • Major publisher policies on AI-generated content (Nature, Science, Springer Nature)
  • Funding agency policies (NIH, NSF) on research integrity
  • International standards for research ethics

Reference Quality and Recency

Strengths:

  • Mix of foundational work and recent developments (2023-2025 papers)
  • Includes both peer-reviewed and preprint sources (appropriate for fast-moving field)
  • Cites authoritative sources (Nature, PNAS, Science for integrity issues)
  • References are relevant and accurately described

Concerns:

  • Heavy reliance on arXiv preprints for some key claims (acceptable but note for context)
  • Some temporal issues: citing "ICLR 2025" data in October 2024 paper requires clarification
  • Model availability claims (o3, GPT-5) need more concrete citation support

Integration with Paper Content

The related work effectively motivates the paper by showing:

  • What exists (AI detection, LLM review, AI scientists, paper mills)
  • What's missing (adversarial testing of the combined system)
  • Why it matters now (convergence of capabilities creating new vulnerabilities)

The transitions from related work to the paper's contributions are clear. However, the paper could more explicitly discuss how its findings relate back to this literature in the discussion/conclusion. For example:

  • How do these acceptance rates compare to human reviewer vulnerabilities documented in integrity literature?
  • How do detection rates compare to other adversarial ML scenarios?
  • What do these findings mean for AI safety research more broadly?

Recommendations for References

Essential additions:

  1. More thorough engagement with adversarial ML and red-teaming literature
  2. Citations for statistical methods used (concentration bounds, calibration analysis)
  3. Human peer review failure mode research for comparison baseline
  4. Established AI safety frameworks (NIST, CSET)

Valuable additions: 5. Traditional peer review research on reliability and biases 6. Detection method literature from adversarial ML 7. Publisher and funding agency policies on AI in research 8. Multi-agent debate and uncertainty quantification methods

Format suggestions: 9. Create a related work comparison table showing how BadScientist differs from each cited system 10. Add a timeline figure showing convergence of AI research agents and LLM review systems


10. Overall Recommendation and Detailed Justification

Recommendation: MAJOR REVISIONS REQUIRED

Decision Rationale

This paper addresses a critical, timely problem at the intersection of AI safety and scientific integrity with genuinely novel contributions (concern-acceptance conflict discovery, rigorous adversarial evaluation framework, quantitative vulnerability assessment). However, fundamental methodological gaps prevent acceptance in current form. Most critically, the absence of human reviewer baselines and legitimate paper control groups makes it impossible to determine whether findings reflect LLM-specific vulnerabilities or general review system weaknesses. Additionally, the disconnect between rigorous theoretical foundations and insufficient statistical reporting in experimental results substantially limits the strength of conclusions.

Justification for Major Rather Than Minor Revisions

Why not "Minor Revisions": The required changes are substantial and fundamental:

  • Adding human reviewer baseline comparison requires new data collection (estimated 20-40 hours human expert time)
  • Creating legitimate paper control group requires additional experiments
  • Comprehensive statistical analysis with CIs, significance tests, and multiple comparison corrections requires substantial reanalysis
  • Expanding mitigation strategies and testing adaptive attackers requires new experiments
  • Cross-venue validation for generalization claims requires additional data

These are not cosmetic changes but core methodological improvements that substantially strengthen the scientific foundation. Estimated revision time: 3-6 months.

Why not "Reject": Despite methodological limitations, the core contributions are valuable:

  • The concern-acceptance conflict finding is novel and important regardless of comparison baselines
  • The BadScientist framework itself is a methodological contribution others can build on
  • The 82% acceptance rate, while needing better contextualization, is alarming and policy-relevant
  • The theoretical framework with formal guarantees is rigorous and well-executed
  • The timing is critical as venues pilot LLM review systems—the warning is valuable now

Path to Acceptance

Essential changes (must address):

  1. Add human reviewer baseline - Recruit experts to review sample of fabricated papers; compare human vs. LLM vulnerability
  2. Include legitimate paper control - Evaluate real papers through same system to establish baseline acceptance rate
  3. Add comprehensive statistical reporting - Sample sizes, confidence intervals, significance tests with multiple comparison corrections for all experimental results
  4. Clarify model availability and timeline - Resolve temporal inconsistencies around ICLR 2025 data and GPT-5 availability
  5. Fully specify calibration methodology - Detailed description enabling reproduction
  6. Define all metrics formally - Mathematical definitions for ICR, concern-acceptance conflict
  7. Expand mitigation section - Detail strategies, compare to established methods, test adaptive attackers

Important changes (should address): 8. Cross-venue validation beyond ICLR 9. Include diverse LLM families (Claude, Gemini, Llama) 10. Improve accessibility by restructuring Sections 3.1-3.5 11. Add concrete fabricated paper examples 12. Enhance reproducibility with code, data, prompts

With these revisions, this work would merit acceptance at a top-tier venue (NeurIPS, ICLR, AAAI, FAccT).

Significance and Impact Assessment

If properly revised, this work would be highly influential:

Research impact: Will likely become a foundational reference for:

  • AI safety research on LLM alignment failures
  • Adversarial ML research in high-stakes domains
  • Scientific integrity and research ethics
  • LLM evaluation methodologies

Expected high citation count (100+ citations within 2 years) given the intersection of hot topics (AI safety, scientific integrity, LLM evaluation) and policy relevance.

Policy impact:

  • Major conferences will likely revise policies on AI-assisted submission and review
  • Publishers (Nature, Springer, Elsevier) may update guidelines based on findings
  • Funding agencies may mandate disclosure requirements
  • Detection tool companies will incorporate insights

Practical impact:

  • Provides red-teaming framework for venues implementing LLM review
  • Validates human oversight requirements in AAAI 2025 pilot
  • Warns community before widespread LLM review adoption
  • Influences next generation of review system and research agent design

Comparison to Venue Standards

Top-tier AI venue criteria (NeurIPS, ICML, ICLR, AAAI):

Novel problem formulation - First systematic adversarial evaluation of LLM review systems ✓ Methodological rigor - Theoretical framework with formal guarantees (when revised) ✓ Significant findings - Concern-acceptance conflict is genuinely surprising ✓ Broad impact - Affects AI safety, publishing systems, policy ✗ Technical quality - Currently limited by missing baselines and statistical issues ✓ Timeliness - Critical as AI review systems deploy

Current state: 3.5/5 stars - valuable contribution with significant limitations After revisions: 4.5/5 stars - strong paper meriting top-tier publication

Recommended Venue and Timeline

Primary venue recommendation: NeurIPS or ICLR given:

  • AI safety and robustness relevance
  • LLM evaluation focus
  • ML community impact
  • ICLR-specific relevance (uses ICLR 2025 calibration data)

Alternative venues:

  • AAAI (given 2025 LLM review pilot—highly timely)
  • FAccT (ACM Fairness, Accountability, Transparency—ethical AI implications)
  • CHI or CSCW (human-AI collaboration in scholarly work)

Timeline recommendation:

  • Revision period: 3-6 months
  • Human expert baseline collection: 1-2 months
  • Additional experiments and analysis: 2-3 months
  • Writing and revision: 1 month
  • Target resubmission: Next conference cycle

Conditional Accept Criteria

The paper would merit acceptance if revised to:

  1. Include human reviewer baseline demonstrating LLM-specific vulnerabilities
  2. Provide legitimate paper control establishing 82% is anomalous
  3. Add complete statistical reporting with CIs, significance tests, and effect sizes
  4. Resolve model availability and calibration timeline issues
  5. Expand and detail mitigation strategies with adaptive attacker testing
  6. Improve accessibility of Sections 3.1-3.5

These revisions would transform the paper from "interesting but limited" to "significant contribution with practical impact."

Final Assessment

This work tackles a genuinely important problem that sits at the dangerous intersection of AI capability advancement and scientific integrity. The concern-acceptance conflict discovery is novel and reveals a fundamental architectural limitation in how LLMs make decisions. The theoretical framework is rigorous, the findings are concerning, and the timing is critical.

However, the paper in its current form makes claims that aren't fully supported by the evidence presented. Adding human baselines, legitimate paper controls, and comprehensive statistical analysis would substantially strengthen the scientific foundation and enable much stronger conclusions about LLM reviewer vulnerabilities specifically.

The core insight is valuable; the execution needs strengthening. With major revisions addressing the identified gaps, this would be an influential paper that shapes how the scientific community approaches AI-assisted research and review. The work deserves publication—but not yet.

Recommendation: Major Revisions with encouragement to resubmit after addressing critical issues.


Summary of Review Components

Methodology: 6/10 - Conceptually sound but missing critical controls (human baseline, legitimate papers) Results: 5/10 - Important findings but insufficient statistical reporting Novelty: 9/10 - Highly novel research question and concern-acceptance conflict discovery Significance: 9/10 - Critical importance to AI safety and scientific integrity Writing: 7/10 - Generally strong but overly technical in places Figures/Tables: 8/10 - Excellent visualizations but missing data (CIs, sample sizes) Overall: 7/10 - Valuable contribution requiring substantial methodological strengthening

Estimated impact if properly revised: Very High (top 10% of papers in AI safety/integrity intersection)

Content is user-generated and unverified.
    BadScientist: AI Peer Review Vulnerability Assessment | Claude