Content is user-generated and unverified.

LLM Reasoning Degrades with Context Length — Even with Perfect Retrieval

The Core Finding

Even when LLMs can perfectly retrieve all relevant information from their context window, reasoning performance still degrades substantially (13.9%–85%) as input length increases. This is not a retrieval problem — it's a fundamental limitation of how these models process long inputs.

Key Research

1. "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval" — Du et al., October 2025 (EMNLP Findings / Amazon Science)

arxiv.org/abs/2510.05381

Tested 5 LLMs (GPT-4o, Claude 3.7 Sonnet, Gemini 2.0, Llama 3.1, Mistral) on math, QA, and coding tasks
Performance degraded 13.9%–85% as input length grew, even within models' claimed context limits
Degradation occurred even when irrelevant tokens were replaced with whitespace (minimal distraction)
Degradation occurred even when irrelevant tokens were fully masked and the model could only attend to relevant tokens
Placing all relevant evidence immediately before the question still showed performance drops
Proposed mitigation: "retrieve then solve" — prompt the model to recite retrieved evidence first, converting it to a short-context task. Showed up to 31.2% improvement on math tasks and 4% on GPT-4o's already-strong baseline

2. "Context Rot" — Chroma Research, July 2025

research.trychroma.com/context-rot

Evaluated 18 state-of-the-art LLMs (GPT-4.1, Claude 4, Gemini 2.5, Qwen3)
Model reliability decreased significantly with longer inputs, even on simple tasks like retrieval and text replication
Adding full conversation history (~113k tokens) dropped accuracy by 30% compared to a focused 300-token version of the same content
Models lost 20–50% accuracy going from 10k to 100k+ tokens on NIAH tasks
No single model ranked first across all experiments — performance was highly task-dependent
Claude models decayed the slowest overall but tended to refuse rather than hallucinate on long tasks
GPT models were more erratic with confident but incorrect answers
Lower similarity between question and answer (i.e., requiring semantic reasoning) caused faster degradation

3. Adobe Research — Multi-hop Reasoning, February 2025

Showed that tasks requiring two reasoning "hops" degraded much more severely than simple retrieval as context grew
LLMs exhibit more severe performance degradation on more complex tasks, not just longer ones

4. "Lost in the Middle" — Liu et al., 2023

arxiv.org/abs/2307.03172

Performance degrades significantly when relevant information is in the middle of long contexts
Models perform best when relevant info is at the beginning or end of input
Foundational paper that identified the positional bias problem

5. "The Limits of Long-Context Reasoning in Automated Bug Fixing" — February 2026

arxiv.org/html/2602.16069

Even with perfect retrieval recall (relevant files injected directly into context), performance degraded sharply
Qwen3-Coder achieved only 7% resolve rate at 64k context
GPT-5-nano solved zero tasks
Common failures: hallucinated diffs, incorrect file targets, malformed patch headers
Concluded that single-shot prompting under long contexts exceeds the effective reasoning capacity of current LLMs

What About Opus 4.6's 1M Context Window?

Opus 4.6 scored 76% on MRCR v2 (long-context retrieval benchmark) vs. Sonnet 4.5's 18.5% — a massive improvement
However, MRCR v2 is a retrieval benchmark (finding needles), not a reasoning-over-context benchmark
The retrieval vs. reasoning distinction is critical: even when models can perfectly retrieve evidence, the volume of surrounding context degrades their ability to reason over it
Prefill latency at 1M tokens can exceed two minutes before first output token
Premium pricing applies beyond 200k tokens ($10/$37.50 per MTok vs. standard $5/$25)

Bottom Line

Retrieval accuracy has genuinely improved in newer models
Reasoning accuracy still degrades with context length — this appears fundamental to transformer attention mechanisms
Effective context engineering (curating what goes into the window) outperforms brute-force context stuffing
RAG with focused context windows will outperform "dump everything into 1M tokens" for planning, design, and reasoning tasks
The 1M window has legitimate use cases for search/navigation over large corpora, but not for sustained reasoning

Content is user-generated and unverified.