How it works
Think of a model reading a long document like a person scanning a very long email thread looking for one specific detail. They'll likely catch it near the top. By page seven of the thread, concentration drifts.
LLMs have a version of this. The Chroma research identified four specific failure modes:
1. Wording mismatch. When a user's question is phrased differently from the answer in the document — even if the meaning is identical — accuracy collapses with length. This is the everyday scenario. Your customers don't phrase queries the way your policy manual is written.
2. Relevant-but-wrong content. A single piece of related-but-incorrect information — a distractor, in research terms — degrades answers significantly. Four distractors compound the damage. This is worse than feeding the model completely unrelated filler.
3. Document structure. Across all 18 models tested, randomly shuffled text actually outperformed logically structured documents. The model's attention mechanism anchors on local patterns — nearby sentences — rather than the narrative as a whole.
4. Position bias. Content near the start of a document is processed more reliably than content near the end. A detail buried 40 pages into a contract is genuinely less likely to surface than the same detail on page one.
The benchmark most commonly used to claim "long context works" — Needle-in-a-Haystack — only tests exact word matching. It doesn't reflect how users actually query AI systems, which is why models can pass it while still failing in production.
Where it matters for SMEs
Document Q&A and legal / compliance tools. If you're using AI to answer questions against policy documents, contracts, or procedure manuals, feeding the whole document in is not safe practice. A 200-page contract processed naively will produce confident-sounding answers that are quietly wrong in ways that don't trigger obvious errors. Structured retrieval — finding the relevant sections first, then asking the question — produces dramatically more reliable results.
Customer-facing chatbots with long conversation histories. A chatbot that retains the full text of every customer interaction will degrade as the conversation lengthens. The practical fix is periodic summarisation: every ten or so exchanges, replace the verbatim history with a compact summary, keeping the session coherent without ballooning the input.
Internal knowledge tools (Notion, SharePoint, CRM integrations). The temptation is to retrieve everything related to a topic and hand it to the model. A better pattern: retrieve broadly, rank by relevance, keep the top eight to ten results, and summarise or discard the rest. This is called rerank-and-prune — it's the architecture that separates production-grade AI tools from proof-of-concept demos.
What to watch out for
"Large context window" is a marketing spec, not a quality guarantee. A one-million-token window means you can send that much text. It does not mean the model will reason over it uniformly. Treat the advertised window as a ceiling, not a safe working range.
The quiet failure mode. Context rot rarely produces obvious errors. The model doesn't say "I'm confused." It produces fluent, confident answers that are wrong. In a customer-support context or a compliance use case, that's the most dangerous kind of failure.
Model choice matters, but not in the way you'd expect. Claude models (Anthropic) showed the lowest hallucination rates in the Chroma research but compensate by abstaining more often — saying "I don't know" rather than guessing. That's a feature in legal or regulated contexts; it can feel like a bug in high-volume customer chat where any response is expected.
Getting started
The single highest-value change most SMEs can make is to audit any existing AI tool that reads documents or conversation history and ask: is this tool retrieving everything and passing it in, or is it selecting and filtering first?
If the answer is "everything goes in," that's your first project. Work with whoever built the tool — or with an AI consultancy — to introduce a reranking step: retrieve a wide set of candidates, score them for relevance to the specific query, keep the best eight to ten, and discard the rest. Research consistently shows this narrow, focused input outperforms broad stuffing — even at a fraction of the token cost.
The rule of thumb: a focused 300-word passage that directly addresses the question will outperform 100,000 words of loosely relevant material. Every time.
Sources:
- Chroma Research — Context Rot: How Increasing Input Tokens Impacts LLM Performance (Hong, Troynikov, Huber): trychroma.com/research/context-rot
- Yannic Kilcher — video analysis of the Chroma paper: youtube.com/watch?v=hpC4qjWu_aY
- LongMemEval benchmark — 113k-token conversational evaluation referenced in the Chroma paper