Content is user-generated and unverified.

Peer Review

Manuscript: Detecting AI-Generated Content in Academic Peer Reviews
Authors: Siyuan Shen, Kai Wang
Venue: [Submitted manuscript, arXiv:2602.00319v1]
Review Date: February 2026


Summary

This paper investigates the temporal emergence of AI-generated content in academic peer reviews using a binary classification approach. A Longformer model fine-tuned with Low-Rank Adaptation (LoRA) is trained on ICLR and Nature Communications (NC) peer reviews from 2021, with synthetic AI-generated counterparts created via the DeepSeek Reasoner API. The trained detector is then applied to reviews from 2022 to 2025 to track longitudinal trends. The authors report near-zero AI detection rates before 2022, rising to approximately 20% for ICLR and 12% for NC by 2025. The paper addresses a timely and important problem for the scholarly publishing community.


Strengths

Timely and relevant contribution. The question of AI infiltration into peer review is of pressing concern across academic disciplines and publishing venues. This paper makes a meaningful early empirical contribution to what remains an under-evidenced debate, providing concrete longitudinal data points that go beyond anecdote or conjecture.

Cross-venue design. The decision to study both a conference-based venue (ICLR) and a journal-based venue (NC) is well-conceived. The parallel trends observed across these structurally distinct settings add credibility to the findings and reduce the risk that results are merely venue-specific artefacts.

Transparent limitations section. The authors demonstrate commendable candour in their limitations discussion, acknowledging the binary classification oversimplification, the risk of overfitting to synthetic training data, and the single-model provenance of all synthetic reviews. This intellectual honesty is to be credited.

Quarterly granularity for NC. The quarterly breakdown of NC data (Figure 3) adds genuine analytical value and supports a more nuanced interpretation of trend dynamics than annual aggregates alone would permit.


Weaknesses and Concerns

Critical methodological circularity in the training data. This is the most significant concern. The detector is trained on synthetic reviews generated by DeepSeek Reasoner, yet the stated goal is to detect AI-generated reviews written by actual academic reviewers — who may have used ChatGPT, Claude, Gemini, Copilot, or other tools. The authors acknowledge this in their limitations section but do not adequately address its implications for result validity. If the model has learned to recognise the stylistic fingerprint of DeepSeek Reasoner specifically, then the "detections" in 2024–2025 reviews may reflect the increasing adoption of DeepSeek itself rather than AI-assisted reviewing as a broader phenomenon. This confound is potentially fatal to the paper's core claims and demands either empirical investigation or a substantially revised framing of the results as DeepSeek-specific detection.

Perfect training set accuracy as a warning signal, not a strength. The model achieves 100% accuracy on the 2021 training data for both ICLR and NC. This is presented neutrally, but in the context of a small, highly curated training set (only 160 or 120 pairs per venue), it is more suggestive of overfitting to superficial distributional differences between real and synthetically generated text than of genuine semantic discrimination. No validation set performance is reported. Without a held-out validation split distinct from the test years, there is no reliable basis to evaluate model quality.

No ground truth for the inference results. The paper's central empirical claim — that 20% of ICLR 2025 reviews are AI-generated — rests entirely on the output of a model whose real-world precision and recall are unknown. There are no spot checks, no human annotations of a detection sample, and no comparison to gold-standard labelled data from 2022–2025. In the absence of any ground truth validation, the reported percentages are model outputs, not established facts. The paper should either commission human annotation of a stratified sample or engage a secondary detection system for cross-validation.

Sample size and representativeness. The ~2,000 reviews per year per venue represent only a modest fraction of total submissions, and the paper-level sampling strategy — while defensible — is not accompanied by any analysis of whether sampled papers are representative of the broader submission pool in terms of subfield, paper length, or acceptance outcome. It is plausible that AI-assisted reviewing correlates with paper characteristics that are not uniformly distributed across the sample.

The quarterly NC trend is non-monotonic and under-explained. Figure 3 reveals that AI-detected percentages in NC peak in Q4 2024 (15.1%), dip to approximately 10–11% in Q1–Q2 2025, before recovering slightly. This non-monotonic pattern is neither noted nor discussed in the text, which states only that "an increasing proportion is observed across successive quarters." This is inaccurate and the pattern warrants investigation. Is this a seasonal effect linked to journal submission cycles? A model calibration artefact? An editing platform policy change? The authors should engage with this directly.

Attribution of the 2024 jump to ChatGPT-4o is speculative. The paper proposes that the GPT-4o free-tier release in May 2024 "may partially explain" the NC trend. While plausible, this causal attribution is entirely unsubstantiated by the data and should either be supported with additional evidence or clearly framed as speculation rather than an interpretive finding.

Provenance of the study. The disclaimer notes that this manuscript arose from a course project (AMCS 5999) at the University of Pennsylvania. While this does not preclude publication, it does mean the work has not yet been subjected to the iterative development expected of full research papers. The scope, validation, and depth of engagement with the literature reflect this context.


Specific Questions for the Authors

  1. How was the threshold for binary classification set? Was any calibration performed, or is the default decision boundary used? Threshold sensitivity analysis could substantially change the reported percentages.
  2. Were any qualitative examples of flagged reviews examined to assess face validity? Even a small illustrative sample would strengthen confidence in the results.
  3. The Liang et al. (2024) study found no evidence of ChatGPT use in Nature portfolio reviews prior to 2024. The authors suggest this is explained by pre-2024 data inclusion. However, NC is part of the Nature portfolio. Is there any analysis confirming that the NC dataset used here and in Liang et al. are sufficiently comparable to support this reconciliation?
  4. Why was DeepSeek Reasoner selected for synthetic review generation rather than a model more widely documented as being used in academic contexts (e.g., GPT-4)?

Recommendation

Major Revision

The paper addresses a genuinely important question and the broad temporal pattern it describes is both plausible and consistent with external evidence. However, the study's core findings currently rest on a foundation of unvalidated model outputs trained on synthetic data from a single LLM. Before these results can be accepted as credible empirical evidence, the authors must address the ground-truth validation problem, provide a more rigorous treatment of the training methodology, and discuss the non-monotonic quarterly trend. A substantially revised paper with these elements would make a valuable contribution to the literature.


Confidential Comments to Editor

This manuscript originates from a graduate course project, as disclosed in the authors' own disclaimer. While the research question is well-chosen and the approach is reasonable in outline, the paper requires considerably more methodological rigour before it merits publication in a peer-reviewed venue. The most pressing concern — that the detection model may be recognising the outputs of a specific LLM (DeepSeek Reasoner) rather than AI-assisted reviewing as a general phenomenon — is not a minor limitation but a potential confound that fundamentally affects what the reported percentages actually mean. I would encourage the authors to pursue this work further with the revisions described above; it has the potential to become a solid empirical contribution. However, in its current form it should not be accepted.

Content is user-generated and unverified.
    AI Detection in Peer Reviews: Peer Review Analysis | Claude