Content is user-generated and unverified.

Peer Review

Manuscript: ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review
Venue: [Redacted for blind review]
Reviewer: [Anonymous]


Summary

This paper introduces ScholarPeer, a search-enabled multi-agent framework for automated scientific peer review. The core argument is that existing systems — whether fine-tuned models or early agentic approaches — evaluate papers in a "parametric vacuum," lacking the live contextual knowledge a human expert brings to review. ScholarPeer addresses this through three specialist agents: a historian that constructs a chronological domain narrative; a baseline scout that adversarially identifies missing comparisons; and a multi-aspect Q&A engine that probes novelty and technical soundness against web-scale retrieval. Evaluation on DeepReview-13K demonstrates strong win-rates against fine-tuned and agentic baselines, and the authors introduce two novel metrics — the H-Max score and Review Diversity Score — to complement standard side-by-side evaluation. The paper is well-motivated and addresses a genuine problem in the field.


Strengths

Conceptual clarity and novelty. The reframing of peer review as a conditional generation problem that conditions on dynamic external context, not frozen parametric weights, is an intellectually crisp contribution. The formal distinction between $P(R|S_{content})$ and $P(R|S_{content}, C_{dynamic})$ usefully anchors the architectural choices.

Specialist agent design. The three-agent decomposition is well-motivated by reviewing practice. The historian agent in particular addresses a gap that naive RAG approaches miss: retrieving raw abstracts is insufficient without understanding the trajectory of a field. The baseline scout operationalises the "chain-of-verification" pattern in a domain-specific and adversarial way that appears genuinely novel among peer review systems.

Evaluation rigour. The combination of LLM-as-a-judge side-by-side scoring, the H-Max score calibrated against real human reviews, and human evaluation on 100 papers is commendably thorough. Judge calibration (Pearson r = 0.53 against human annotators) and inter-judge agreement tables (Table 5 and 6) increase confidence in the automated results. The ablation in Table 4 cleanly isolates each module's contribution, with the Q&A agent's 26-point drop being particularly striking.

Honesty about limitations. The qualitative analysis in Figure 4 and Appendix E.3 openly documents failure modes: ScholarPeer is weaker than some baselines on internal consistency checking, occasionally missing logical impossibilities within a paper's own text. This kind of candour strengthens rather than weakens the paper.

Novel metrics. The H-Max score — which calibrates AI review quality against the best human reviewer rather than the average — is a meaningful contribution in its own right and should be adopted more widely. The Review Diversity Score (RDS) addresses the "artificial hivemind" problem identified by concurrent work.


Weaknesses and Concerns

Circular evaluation design. The most significant methodological concern is that Claude Sonnet 4.5 is simultaneously (a) one of the evaluated baselines and (b) the primary LLM judge for all automated evaluation. The authors acknowledge this briefly, noting that ScholarPeer scores higher than Claude-based agentic baselines despite Claude serving as judge. This is treated as evidence of robustness, but the argument is not fully convincing. A systematic bias — in either direction — cannot be excluded without a more formal analysis. The Gemini 3.0 Pro judge results in Appendix D show meaningfully different absolute numbers (e.g., Claude Sonnet 4.5 single-agent H-Max rises from 4.50 to 6.63), confirming that judge identity materially affects scores. The authors should present the Gemini judge results as co-primary rather than relegating them to the appendix, and should discuss the divergence more carefully.

Temporal constraint enforcement. The entire framework's validity hinges on search agents strictly retrieving literature published before a paper's submission date. The cutoff date is threaded through the prompts (Appendix G), but it is unclear how consistently the underlying Google Search tool honours this constraint in practice. Web search engines do not expose submission dates for preprints, and blog posts or repositories may carry inaccurate timestamps. The authors should report empirical failure rates for this constraint, or at minimum discuss how the cutoff is operationalised and verified.

Domain generalisability. The DeepReview-13K dataset is drawn entirely from ICLR 2024–2025 — a single venue in machine learning. It is unknown whether ScholarPeer's historian and baseline scout would perform equivalently in biomedical sciences, economics, or humanities, where literature is differently structured, where benchmark comparisons are less central to evaluation, and where the most relevant prior art may reside in books or non-indexed sources rather than arXiv. The paper should acknowledge this limitation and either offer preliminary evidence of cross-domain transfer or clearly scope its claims to ML conference reviewing.

Comparison with Stanford Agent Reviewer (SAR). Evaluation against SAR is limited to 50 papers (25 per ICLR vintage), making the reported win-rates of 54% (Claude judge) and 64% (Gemini judge) statistically fragile. Given that SAR is the strongest closed-source comparator, this comparison deserves a larger sample. The authors should either expand this evaluation or provide confidence intervals that make the current sample size's uncertainty explicit.

Computational and deployment practicality. At approximately 20 LLM calls per review, ScholarPeer is 20× more expensive than fine-tuned baselines. The paper notes this but does not report wall-clock latency, which matters enormously for integration into real submission systems processing thousands of papers simultaneously. A discussion of cost and latency at realistic conference scale would strengthen the paper's practical impact.

Privacy and confidentiality. The impact statement identifies the risk that unpublished manuscripts are processed by external search and LLM APIs. This deserves more than a passing mention. Many conference systems operate under explicit policies prohibiting submission of under-review material to third-party systems. The paper should discuss whether ScholarPeer can function with a private deployment of the search infrastructure, and what degradation in performance would result.


Minor Comments

  • The H-Max anchor of 5 = "human level" and 10 = "transformative" is intuitive but the mapping between scores and natural language descriptions (Appendix F.3) should be included in the main paper to ensure reproducibility of the scoring rubric.
  • Figure 1 (middle panel) would benefit from error bars; it is unclear whether the H-Max differences between neighbouring systems are statistically significant.
  • The paper notes that ScholarPeer's review diversity (0.29) remains below human diversity (0.43) but does not discuss why — whether this reflects a shared retrieval pathway, similar historian narratives, or something intrinsic to the backbone model.
  • Appendix G is admirably transparent in sharing agent prompts, though the literature expansion prompt (G.3) does not specify how many expansion rounds are executed by default.

Recommendation

Major Revision

ScholarPeer presents a conceptually sound and empirically strong contribution to automated peer review. The specialist agent architecture, the H-Max evaluation metric, and the honest qualitative analysis are all genuine advances. However, the circular evaluation design, uncertainty over temporal constraint enforcement, limited cross-domain evidence, and the underpowered SAR comparison collectively require substantive revision before acceptance. These are addressable concerns rather than fundamental flaws, and the core contribution should survive revision intact.


Confidential Comments to Editor

This reviewer has a potential indirect interest to disclose: the reviewing field is one in which the reviewer works professionally, though not in AI-generated reviewing specifically. This has been considered carefully and is not believed to have materially influenced the assessment above.

The circular evaluation issue — Claude Sonnet 4.5 as both evaluated system and judge — is the most pressing editorial concern. If the venue allows, the editor may wish to request that a revised submission elevate the Gemini judge results to co-primary status and provide a more thorough sensitivity analysis. The core results appear robust, but the optics of the current presentation are avoidably awkward and are likely to attract criticism post-publication.

It is also worth noting the somewhat recursive nature of this submission: the paper being reviewed uses Claude Sonnet 4.5 as its evaluation judge, and the reviewer is Claude. The authors may wish to acknowledge this in any public version of the review discussion, as it represents an interesting edge case in the context of their own work on automated reviewing.

Content is user-generated and unverified.
    ScholarPeer Peer Review: AI Framework Analysis | Claude