Content is user-generated and unverified.

LLM Reasoning Degrades with Context Length — Even with Perfect Retrieval

The Core Finding

Even when LLMs can perfectly retrieve all relevant information from their context window, reasoning performance still degrades substantially (13.9%–85%) as input length increases. This is not a retrieval problem — it's a fundamental limitation of how these models process long inputs.

Key Research

1. "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval" — Du et al., October 2025 (EMNLP Findings / Amazon Science)

arxiv.org/abs/2510.05381

  • Tested 5 LLMs (GPT-4o, Claude 3.7 Sonnet, Gemini 2.0, Llama 3.1, Mistral) on math, QA, and coding tasks
  • Performance degraded 13.9%–85% as input length grew, even within models' claimed context limits
  • Degradation occurred even when irrelevant tokens were replaced with whitespace (minimal distraction)
  • Degradation occurred even when irrelevant tokens were fully masked and the model could only attend to relevant tokens
  • Placing all relevant evidence immediately before the question still showed performance drops
  • Proposed mitigation: "retrieve then solve" — prompt the model to recite retrieved evidence first, converting it to a short-context task. Showed up to 31.2% improvement on math tasks and 4% on GPT-4o's already-strong baseline

2. "Context Rot" — Chroma Research, July 2025

research.trychroma.com/context-rot

  • Evaluated 18 state-of-the-art LLMs (GPT-4.1, Claude 4, Gemini 2.5, Qwen3)
  • Model reliability decreased significantly with longer inputs, even on simple tasks like retrieval and text replication
  • Adding full conversation history (~113k tokens) dropped accuracy by 30% compared to a focused 300-token version of the same content
  • Models lost 20–50% accuracy going from 10k to 100k+ tokens on NIAH tasks
  • No single model ranked first across all experiments — performance was highly task-dependent
  • Claude models decayed the slowest overall but tended to refuse rather than hallucinate on long tasks
  • GPT models were more erratic with confident but incorrect answers
  • Lower similarity between question and answer (i.e., requiring semantic reasoning) caused faster degradation

3. Adobe Research — Multi-hop Reasoning, February 2025

  • Showed that tasks requiring two reasoning "hops" degraded much more severely than simple retrieval as context grew
  • LLMs exhibit more severe performance degradation on more complex tasks, not just longer ones

4. "Lost in the Middle" — Liu et al., 2023

arxiv.org/abs/2307.03172

  • Performance degrades significantly when relevant information is in the middle of long contexts
  • Models perform best when relevant info is at the beginning or end of input
  • Foundational paper that identified the positional bias problem

5. "The Limits of Long-Context Reasoning in Automated Bug Fixing" — February 2026

arxiv.org/html/2602.16069

  • Even with perfect retrieval recall (relevant files injected directly into context), performance degraded sharply
  • Qwen3-Coder achieved only 7% resolve rate at 64k context
  • GPT-5-nano solved zero tasks
  • Common failures: hallucinated diffs, incorrect file targets, malformed patch headers
  • Concluded that single-shot prompting under long contexts exceeds the effective reasoning capacity of current LLMs

What About Opus 4.6's 1M Context Window?

  • Opus 4.6 scored 76% on MRCR v2 (long-context retrieval benchmark) vs. Sonnet 4.5's 18.5% — a massive improvement
  • However, MRCR v2 is a retrieval benchmark (finding needles), not a reasoning-over-context benchmark
  • The retrieval vs. reasoning distinction is critical: even when models can perfectly retrieve evidence, the volume of surrounding context degrades their ability to reason over it
  • Prefill latency at 1M tokens can exceed two minutes before first output token
  • Premium pricing applies beyond 200k tokens ($10/$37.50 per MTok vs. standard $5/$25)

Bottom Line

  • Retrieval accuracy has genuinely improved in newer models
  • Reasoning accuracy still degrades with context length — this appears fundamental to transformer attention mechanisms
  • Effective context engineering (curating what goes into the window) outperforms brute-force context stuffing
  • RAG with focused context windows will outperform "dump everything into 1M tokens" for planning, design, and reasoning tasks
  • The 1M window has legitimate use cases for search/navigation over large corpora, but not for sustained reasoning
Content is user-generated and unverified.
    LLM Reasoning Degrades with Context Length: Research Guide | Claude