Content is user-generated and unverified.

Peer Review

Manuscript: When AI reviews science: Can we trust the referee?
Journal: The Innovation Informatics
DOI: 10.59717/j.xinn-inform.2026.100030
Review Date: 13 February 2026


Summary

This manuscript presents a security- and reliability-centred analysis of AI-assisted peer review systems. The authors construct a lifecycle-wide threat taxonomy spanning training and data retrieval, desk review, deep review, rebuttal, and system-level vectors, then instantiate it with four controlled experimental probes using 100 ICLR 2025 submissions evaluated by two large language model (LLM) referees — Gemini 2.5 and GPT 5.1. The four probes target prestige framing (authority bias), assertion strength, rebuttal sycophancy, and contextual poisoning. The paper is timely, given the documented real-world exploitation of AI peer-review systems in 2025, and it makes a credible contribution to a rapidly evolving conversation about the integrity of automated scholarly evaluation.


Major Comments

1. Ecological validity of the authority-bias probe

The identity-bias experiment manipulates the system prompt to introduce prestige framing — informing the AI that a paper originates from a "flagship laboratory" or a "small team." While this cleanly isolates the variable of interest, it does not reflect how real-world prestige manipulation would occur. In practice, an author seeking to exploit authority bias would embed institutional or reputational signals within the manuscript itself — through author-line formatting, acknowledgements, or citation choices — rather than through a system-level instruction. The paper acknowledges that metadata was sanitised prior to evaluation, which actually removes the natural channel through which this bias would operate in deployed systems. The authors should either (a) redesign this probe to embed prestige cues within the manuscript content itself, or (b) acknowledge explicitly that the current design tests a narrower and less operationally realistic attack vector than is implied. The asymmetry finding — that low-prestige framing penalises more severely than high-prestige framing rewards (−0.72 vs. +0.25) — is the paper's most striking empirical result and deserves more interpretive depth. Is this asymmetry consistent with loss-aversion heuristics observed in LLM evaluation literature, or is it an artefact of the system-prompt modality used? This warrants discussion.

2. The assertion-strength finding requires closer examination

The result that cautious language is systematically penalised (−0.39) while bold language elicits scores nearly identical to baseline is counter-intuitive and important. However, the paper conflates two distinct hypotheses: (a) that LLMs reward confidence as a proxy for quality, and (b) that LLMs penalise hedging as a signal of weakness. The experimental design, which compares cautious, neutral, and bold variants against the original text, cannot clearly distinguish between these. The fact that bold language does not inflate scores relative to baseline (indeed, it slightly deflates them for GPT 5.1 at −0.24, p < 0.001) suggests the mechanism is specifically an aversion to uncertainty markers rather than a reward for confidence — a subtler and arguably more concerning finding. The authors acknowledge this as "a distinct aversion to expressions of scientific uncertainty" but do not pursue the mechanistic explanation. The Discussion should engage with whether this bias emerges from training data characteristics (e.g., the language of accepted versus rejected papers used to fine-tune these models), as this would have direct implications for mitigation.

3. Generalisability is constrained by domain and model selection

The corpus is drawn exclusively from ICLR 2025 — a machine learning venue with distinctive norms around empirical benchmarking, ablation studies, and reproducibility standards. The paper's framing presents its findings as applicable to AI peer review broadly, including in biomedical, social science, and humanities publishing, where rhetorical conventions, review criteria, and manuscript structure differ substantially. A cautionary statement regarding domain specificity is warranted in the Discussion. Similarly, the two AI referees selected — Gemini 2.5 and GPT 5.1 — are among the most capable frontier models currently available. Findings may not generalise to lighter-weight or older models that are, arguably, more likely to be deployed in cost-sensitive production review workflows at smaller journals and conferences.

4. The defence framework is largely prescriptive and empirically untested

Section 5 (Discussion) offers a structured and conceptually sound set of defences organised by review stage. However, none of these mitigations is empirically evaluated in the present study. The prose-heavy, stage-by-stage treatment closely parallels the attack taxonomy, which gives it structural elegance, but the recommendations remain speculative. For example, the proposal to introduce "controlled randomness" during the rebuttal phase to reduce the predictability of the AI's evaluation state is theoretically plausible but could equally degrade review consistency for legitimate authors. The authors should clearly frame this section as a research agenda rather than a set of validated countermeasures, and should prioritise the two or three interventions most tractable for near-term empirical evaluation.


Minor Comments

  • Table 2: The "Feas." column distinguishes between attacks "evidenced in practice" (•) and "theoretically feasible" (◦). Several entries in the latter category (e.g., structure spoofing, academic packaging) have plausible analogues in the current literature on LLM-as-judge fragility. The authors should verify that the cited evidence (e.g., •Robertson.142 for abstract hijacking) is being correctly characterised; Robertson (2023) is a pilot study of GPT-4 as a reviewer and does not explicitly address abstract hijacking as an adversarial tactic.
  • Rebuttal probe design: The evidence-free rebuttal used in Probe 3 is a single, standardised template applied uniformly across all 100 papers. The finding that 81–89% of papers receive score inflation is compelling, but the uniformity of the intervention limits insight into which features of the rebuttal (assertive tone, appeal to field norms, politeness framing) are doing the persuasive work. Even a minimal two-condition variant (assertion-only vs. politeness-only) would strengthen interpretability.
  • Figure 4 caption vs. results text inconsistency: The caption reports an average score increase of +0.25 for high-prestige framing and a penalty of −0.72 for low-prestige, but the results text reports +0.25 and −0.72, while Table 3 reports Gemini 2.5 at −0.85 and GPT 5.1 at −0.59 for low-prestige. The average of −0.72 is not a simple mean of these values. The basis for the aggregate figure should be clarified.
  • Literature coverage: The paper cites a strong body of adversarial ML literature but is lighter on peer review sociology and science-of-science research. Given that authority bias and sycophancy have documented counterparts in human peer review, engaging with this literature (e.g., work on halo effects in double-blind review) would contextualise whether AI systems are introducing new distortions or amplifying pre-existing ones — a distinction with important policy implications.
  • Linguistic precision: The phrase "systematic penalty for cautious language" in the results heading is somewhat misleading given that the bold condition does not produce a corresponding reward. "Asymmetric penalty for epistemic hedging" would be more precise.

Recommendation

Major Revision

The manuscript addresses a genuinely important and timely problem, presents novel empirical evidence across multiple attack vectors, and is structured coherently. However, the ecological validity of the authority-bias probe, the interpretive incompleteness of the assertion-strength findings, and the absence of empirical grounding in the defence section require substantive revision before this work can be accepted. The authors are encouraged to revise and resubmit.


Confidential Comments to Editor

The authors include one Editorial Board member (Shimin Di) who declares appropriate recusal from the review process. The handling editor should confirm that this was enforced in practice and that reviewer assignments were made without input from this co-author's group.

More substantively, the experimental corpus relies on ICLR 2025 submissions, which are not yet widely available in the public domain in processed form. The Hugging Face dataset link provided should be verified for completeness and confirmed to comply with OpenReview's terms of data use prior to publication. If the dataset has been constructed from PDF conversions of non-open-access preprints without explicit licensing clearance, this could pose a compliance issue for the journal.

Finally, the use of GPT 5.1 as an experimental referee should be confirmed with the model vendor's current usage policies, particularly given that some AI providers restrict automated evaluation use-cases in their terms of service. The editor may wish to request written confirmation from the authors that their experimental protocol complied with applicable terms of use for both model APIs.

Content is user-generated and unverified.
    AI Peer Review Security: Threats & Vulnerabilities Analysis | Claude