Deep Analysis: Unjournal Evaluator Suggestions vs. Paper Revisions
Paper: Karger et al. "Forecasting Existential Risks: Evidence from a Long-Run Forecasting Tournament"
- Working Paper: August 2023
- Published Version: International Journal of Forecasting 41 (2025) 499–516
- Unjournal Evaluation: August 2024
Summary of Evaluator Suggestions (from Evaluation Summary)
The Unjournal evaluation identified the following areas for improvement:
A. Data Sharing
- "Data sharing was suggested"
- "Is the data shared in a clear and useful way? How could it be made more useful?"
B. Design and Implementation
- Questions framing - "Were the prediction questions well-framed?"
- Definition of expertise - "Was the choice of 'experts' reasonable?"
- Lack of training - noted as criticism
- Rare events description - "What specifically are the most appropriate methods for eliciting these rare event forecasts?"
- Anchoring behavior - "Could there be substantial anchoring behavior as a result of their displaying the 'Prior Forecasts'?"
- Delphi process considerations (implied in "Issues meriting further evaluation")
C. Statistical/Quantitative Analysis
- "CI not capturing the uncertainty properly"
- "Dependence measures not estimated"
- "Statistical inference not performed"
- "Aggregation methods not appropriate"
D. Attrition Bias
- "Did they adequately consider the potential for attrition bias?"
Detailed Comparison by Suggestion
1. DATA SHARING / REPLICATION PACKAGE
2023 Working Paper:
- No formal data availability statement
- No replication package mentioned
- Some scattered references to "data available here" in question descriptions
2025 Published Paper:
VERDICT: ✅ ADDRESSED
Probability that change was caused by Unjournal feedback: 25-35%
Rationale:
- The International Journal of Forecasting likely requires data availability statements as standard practice
- However, Unjournal feedback was explicit on this point and could have reinforced attention to data sharing
- The xpt-lib repository was created before the evaluation (mentioned as "previously-published"), but the xpt-ijf-replication package may have been enhanced post-evaluation
- Most likely this reflects journal requirements + general good practice, with possible reinforcement from Unjournal feedback
2. DELPHI PROCESS DISCUSSION
2023 Working Paper:
- ZERO mentions of "Delphi" (confirmed via text search)
- No comparison to Delphi methodology
- No discussion of anonymity trade-offs
2025 Published Paper:
- Substantial new discussion of Delphi methods
Methods section quote:
"the multi-stage design of this process was heavily inspired by Delphi processes (Rowe & Wright, 2001) but deviated from traditional Delphi methods in significant ways. In particular, forecasters were not necessarily anonymous beyond Stage 1, and forecasters were given access to each other's forecasts and rationales at earlier stages than in many Delphi processes."
Discussion section quotes:
"One question is whether the XPT would have yielded more evidence of opinion convergence if it had been designed more like a Delphi process, with anonymity (e.g., randomly generated usernames, norms against participants naming themselves in discussions) and more tightly controlled feedback (for a review of Delphi approaches, see Rowe and Wright (1999))."
"The XPT already bears some family resemblance to Delphi, given the independent forecasts in Stage 1 that participants brought to the group discussions. The Delphi family of methods is designed to encourage participants to focus on the substantive merits of arguments for and against points of view, regardless of the status of the person advancing the arguments."
"But that judgment call may have been a mistake. People find it easier to change their minds when they are not anchored down by past public commitments to positions (Festinger, 1957), and they also find it easier to change their minds if they can do so in privacy, reducing the risk of appearing weak or confused. The next operationalization of the XPT might do well to encourage or require the expression of anonymous judgments throughout all four stages of deliberation."
VERDICT: ✅ SUBSTANTIALLY ADDRESSED
Probability that change was caused by Unjournal feedback: 30-45%
Rationale:
- The International Journal of Forecasting readership would expect Delphi discussion - this is a core methodology in the forecasting literature
- The Unjournal evaluation explicitly flagged Delphi considerations
- The addition is substantial and directly addresses evaluator concerns about design trade-offs
- However, IJF peer reviewers would likely have requested this comparison independently
- Timeline suggests IJF submission was likely concurrent with or shortly after Unjournal evaluation publication
3. ANCHORING BEHAVIOR FROM DISPLAYING "PRIOR FORECASTS"
2023 Working Paper:
- 28 mentions of "anchor" - BUT almost all refer to "biological anchors" (a technical forecasting methodology reference, citing Ajeya Cotra's work)
- One mention of anchoring methods: "How much lower or higher is your extinction risk estimate than an anchor or comparison value?"
- No discussion of PSYCHOLOGICAL anchoring from displaying prior forecasts
2025 Published Paper:
- 1 mention of "anchor" - specifically addressing psychological anchoring to public positions
- Direct quote: "People find it easier to change their minds when they are not anchored down by past public commitments to positions (Festinger, 1957)"
- Explicitly acknowledges this as a potential design limitation
VERDICT: ✅ PARTIALLY ADDRESSED
Probability that change was caused by Unjournal feedback: 35-50%
Rationale:
- The evaluators explicitly raised this concern: "Could there be substantial anchoring behavior as a result of their displaying the 'Prior Forecasts'?"
- The 2025 paper now directly acknowledges psychological anchoring as a design limitation
- This appears to be a direct response to critiques (though could also be from IJF reviewers)
- The specific focus on anchoring to "past public commitments" aligns precisely with the evaluator concern
4. ATTRITION BIAS
2023 Working Paper:
- Mentions 34% attrition rate explicitly
- Substantial discussion of recruitment challenges and attrition as "down-to-earth problems"
- Quote: "The project was time-consuming, and our attrition rate was roughly 34% from initial forecasts to completion of the tournament four months later."
- Discusses reasons for attrition and plans for future improvements
2025 Published Paper:
- Brief footnote: "Of the 111 forecasters who completed all four stages of the tournament, 72 were superforecasters and 39 were experts. Although 111 completed all stages of the tournament, we report data from forecasters who attrited from the tournament in relevant analyses below."
- Less detailed discussion than 2023 version
- No formal attrition bias analysis added
VERDICT: ❌ NOT ADDRESSED (actually less content)
Probability that change was caused by Unjournal feedback: 0%
Rationale:
- The 2025 paper has LESS discussion of attrition, not more
- No formal attrition bias analysis was added as evaluators suggested
- The streamlining likely reflects journal word limits, not evaluation feedback
- This suggestion was clearly NOT incorporated
5. STATISTICAL INFERENCE / DEPENDENCE MEASURES / AGGREGATION METHODS
2023 Working Paper:
- 44 mentions of statistical terms (regression, correlation, bootstrap, confidence interval, etc.)
- Bootstrapped confidence intervals presented for medians
- Some correlation discussion (with caveats about outliers)
2025 Published Paper:
- 9 mentions of statistical terms (substantially fewer)
- Same bootstrapped CI approach maintained
- Quote: "We also present bootstrapped confidence intervals for each median."
- No new statistical inference added
- No dependence measures added
- No changes to aggregation methods
VERDICT: ❌ NOT ADDRESSED
Probability that change was caused by Unjournal feedback: 0%
Rationale:
- The statistical methodology is essentially unchanged
- The 2025 paper is actually MORE condensed statistically (fewer details)
- Evaluator criticisms about inadequate statistical inference were not incorporated
- No dependence measures were added
- Aggregation methods remain the same (median with bootstrapped CIs)
6. QUESTIONS FRAMING
2023 Working Paper:
- 17 mentions of framing issues
- Extensive appendices with detailed question wordings
- Some discussion of how questions were framed
2025 Published Paper:
- 0 mentions of question framing methodology
- Streamlined presentation
- No additional discussion of framing limitations
VERDICT: ❌ NOT ADDRESSED
Probability that change was caused by Unjournal feedback: 0%
Rationale:
- No additional discussion of framing limitations was added
- Content was actually reduced for journal format
7. DEFINITION OF EXPERTISE / EXPERT SELECTION
2023 Working Paper:
- Describes expert recruitment process
- Notes limitations of sample representativeness
2025 Published Paper:
- Same description of recruitment
- Quote: "We received more than 500 expressions of interest, screened these respondents for expertise, and offered slots to the best qualified after a review of their backgrounds."
- Footnote: "Two independent analysts categorized applicants based on publication records and work history. When the analysts disagreed, a third independent rater resolved disagreement after a group discussion."
VERDICT: ⚠️ NO CHANGE
Probability that change was caused by Unjournal feedback: 0%
Rationale:
- Expert selection methodology is presented identically
- No additional justification or critique incorporated
8. RARE EVENTS ELICITATION METHODS
2023 Working Paper:
- 18 mentions of rare events/low probability concepts
- Explicit "Next Steps" section: "Identify and validate better methods of eliciting low-probability forecasts"
- Discussion of alternative elicitation methods tested (1-in-X format)
- Quote: "Researchers have long known about the instability of tiny probability estimates"
2025 Published Paper:
- 0 direct mentions of rare events elicitation methodology
- Only brief acknowledgment in abstract: "we acknowledge limits on what even skilled forecasters can achieve in anticipating rare or unprecedented events"
VERDICT: ❌ NOT ADDRESSED (less content)
Probability that change was caused by Unjournal feedback: 0%
Rationale:
- The 2025 paper has LESS methodological discussion of rare events elicitation
- The "Next Steps" section with research agenda was removed
- This reflects journal condensation, not evaluation feedback incorporation
SUMMARY TABLE
| Evaluator Suggestion | Addressed? | Probability Causal |
|---|
| Data sharing/replication | ✅ Yes | 25-35% |
| Delphi process discussion | ✅ Yes | 30-45% |
| Anchoring behavior | ✅ Partially | 35-50% |
| Attrition bias | ❌ No | 0% |
| Statistical inference | ❌ No | 0% |
| Dependence measures | ❌ No | 0% |
| Aggregation methods | ❌ No | 0% |
| Questions framing | ❌ No | 0% |
| Definition of expertise | ❌ No | 0% |
| Rare events elicitation | ❌ No | 0% |
OVERALL ASSESSMENT
What Changed:
- Data availability/replication package - Formal statement and GitHub links added
- Delphi discussion - Substantial new content positioning XPT relative to Delphi methodology
- Anchoring acknowledgment - New discussion of psychological anchoring to public positions
- Academic structure - New Discussion section with methodological reflections
- Author list - Reduced from 11 to 5 authors
What Did NOT Change (Despite Evaluator Suggestions):
- Statistical methodology (no new inference, dependence measures, or aggregation methods)
- Attrition bias analysis (actually less discussion)
- Question framing justification
- Expert selection methodology
- Rare events elicitation discussion (actually less)
Overall Probability Assessment
Probability that Unjournal evaluation materially influenced ANY revision: 20-35%
Probability that Unjournal evaluation was the PRIMARY cause of revisions: 10-20%
Reasoning:
- Timeline Constraints: The Unjournal evaluation was published in August 2024. The paper was published in IJF 2025 (accepted ~November 2024). This is an extremely tight window for evaluation → revision → resubmission → acceptance.
- Parallel Processes: The paper was almost certainly already in IJF peer review when the Unjournal evaluation was published. IJF reviewers likely made similar suggestions independently.
- Standard Journal Requirements: Data availability statements are standard for IJF. Delphi discussion would be expected by forecasting journal readership.
- Selective Incorporation: Only 3 of 10 specific suggestions show any evidence of being addressed. The most technical statistical suggestions (which would require substantial new analysis) were not incorporated.
- Author Non-Response: The Unjournal summary explicitly states: "The authors have not provided a written response to these evaluations."
Most Likely Scenario:
The authors were preparing the paper for academic journal submission when the Unjournal evaluation was published. IJF peer reviewers independently requested similar changes (Delphi discussion, data availability). The Unjournal feedback may have:
- Reinforced the importance of certain points
- Provided additional motivation for addressing specific issues
- Influenced wording or framing of new sections
However, the IJF peer review process was almost certainly the primary driver of substantive revisions. The tight timeline and lack of formal author response to Unjournal strongly suggests parallel rather than causal processes.