Implementation considerations
RAG pipeline architecture
The canonical failure mode isn't using RAG — it's stopping at retrieval without reranking. The typical stack:
embed query → ANN search (FAISS / pgvector / Pinecone) → top-50 candidates
→ cross-encoder rerank (Cohere Rerank v3 / BGE-Reranker-v2) → top 8
→ LLM answer step
Skipping the reranker is where the majority of production deployments silently degrade. ANN retrieval operates on embedding cosine similarity, which is a coarse semantic signal. Cross-encoders see the full (query, passage) pair and produce significantly more discriminative relevance scores — but at O(n) inference cost over the candidate set. Cohere Rerank v3 is the current practitioner standard for hosted deployments; BGE-Reranker-v2-M3 for self-hosted.
Top-k selection after rerank is a judgement call, but 5–10 chunks is the validated sweet spot for most SME document Q&A workloads. Going beyond 15 is almost always counterproductive given the distractor-density findings.
Hierarchical summarisation
For corpora exceeding ~50k tokens (long contracts, matter files, multi-threaded email chains), a two-stage architecture performs better:
- Split into logical sections.
- Summarise each section with a cheap model (Haiku 3.5, GPT-4o-mini). Target ~500 tokens per summary.
- Answer over summaries.
- If confidence is low or the answer requires verbatim evidence, drill into the specific source sections.
This trades latency and cost for accuracy on very long documents, and scales beyond what any single-pass context window can handle reliably regardless of the stated limit.
Conversation compression
Agent and chatbot deployments are where context rot is most insidious — history grows turn-on-turn with no natural pruning event. Practical threshold: every 10 turns or ~4k tokens of history, replace verbatim turns with a structured summary (topic, decisions made, outstanding actions). The summary should be structured rather than prose — structured text is more reliably retrievable under the lexical alignment findings above.
Distractor scrubbing (two-pass)
For high-stakes retrieval (legal, financial, regulated), add a pre-answer passage-selection step:
- Ask the model to return relevant passage IDs only — no answer.
- Run the actual answer step over only the selected passages.
This costs one extra LLM call but dramatically reduces the hallucination surface by making the distractor problem explicit rather than hoping attention handles it.
Trade-offs
| Approach | Token cost | Accuracy | Latency | Maintenance | |---|---|---|---|---| | Full-context stuffing | Cheapest | Worst at scale | Lowest | Minimal | | Naive RAG (top-50, no rerank) | Medium | Mediocre | Medium | Low | | Rerank-and-prune | Higher | Best for most workloads | +200–400ms | Medium | | Hierarchical summary | Highest | Excellent on long docs | Highest | High |
The "1M-token context → no RAG needed" argument is seductive but wrong. Gemini 2.5 Pro's million-token window is useful for first-pass discovery retrieval — identifying which sections of a document are relevant. It is not a substitute for reranking before the answer step. Treat large-context models as a chunking-and-retrieval pre-pass, not as an accuracy guarantee.
Fine-tuning is not a solution here. Context rot is an attention mechanism property, not a knowledge deficit. Fine-tuning on domain content does not alter how the model attends to long inputs.
Failure modes and mitigations
Silent accuracy loss. The most dangerous property of context rot is that it's not obviously broken — the model returns a confident-sounding answer that happens to be wrong. Standard accuracy monitoring (error rate, latency) won't catch this. Mitigation: run a LongMemEval-lite evaluation harness monthly against a representative sample of production queries and a known-answer corpus. This is the de facto NIAH replacement for serious evaluation.
Model-specific pathologies. GPT-4.1 refuses ~2.55% of queries around 2,500 words; Gemini 2.x hallucinates spurious content at 500–750 tokens under specific distractor configurations; Qwen3-8B degrades hard past 5,000 tokens. These aren't edge cases — they're predictable failure bands that need to be in your evaluation baseline, not discovered by a client. Claude Opus 4 / Sonnet 4 abstain heavily under ambiguity rather than confabulate, which is the correct failure mode for regulated applications and an annoyance for consumer-facing chat.
Coherent document ordering. If your chunking strategy preserves logical narrative flow (chapter order, thread chronology), you may be inadvertently triggering the coherence-bias pathology. Shuffling chunk order before context injection is a low-cost mitigation worth testing — counterintuitive, but empirically validated across all 18 models in the study.
Reranker latency at scale. Cross-encoder reranking at top-50 adds ~200–400ms on hosted Cohere, more for self-hosted. At high-QPS, this matters. Options: reduce the initial ANN candidate set (top-20 rather than top-50); use a lighter reranker (BGE-Reranker-v2-M3 at quantised precision); cache reranker scores for common query patterns.
Cost and operational profile
For a typical SME document Q&A deployment (100–500 queries/day, ~10k-token document corpus):
- Rerank-and-prune: +~£0.002–0.006 per query for Cohere Rerank v3 at top-50. Negligible at this volume.
- Hierarchical summary pre-pass: 1× cheap-model call per document section, amortised over queries. At Haiku 3.5 pricing, a 100-page document summarisation costs under £0.05.
- LongMemEval-lite harness: run monthly, ~300 queries against a fixed evaluation set. At Sonnet 4 pricing, under £2/month. No excuse not to run it.
Token cost is not the operational bottleneck at SME scale. Latency and maintenance complexity are. Rerank-and-prune sits at the right inflection point — materially better accuracy than naive RAG, manageable latency overhead, low ongoing maintenance relative to hierarchical summarisation.
Recommended approach for SME context
Default to rerank-and-prune for all RAG deployments: retrieve top-50 via ANN, rerank with Cohere Rerank v3, keep top 8, answer. This single change eliminates the majority of distractor-induced hallucinations and is straightforward to retrofit into existing LangChain or LlamaIndex pipelines via a reranker node.
Escalate to hierarchical summarisation only when the source corpus regularly exceeds 50k tokens per query scope (full matter files, entire contract suites, long email threads).
Audit any live deployment where the retrieval step returns more than 15 chunks before the answer call — that's the leading indicator of naive RAG in production.
For model selection: if the application is regulated or high-stakes, Claude Sonnet/Opus 4's abstention behaviour is a feature. If it's consumer-facing chat where "I'm not sure" is worse than a slightly hedged answer, factor that into model choice.
Do not use NIAH benchmark results to validate long-context performance. Run LongMemEval-style evaluations — or build a minimal equivalent — against your actual query distribution and document corpus. What NIAH tests is not what breaks in production.
Sources
- Hong, Troynikov, Huber — Context Rot: How Increasing Input Tokens Impacts LLM Performance — Chroma Research: https://www.trychroma.com/research/context-rot
- Yannic Kilcher — video analysis of the Chroma paper: https://www.youtube.com/watch?v=hpC4qjWu_aY
- LongMemEval benchmark — referenced in the Chroma study as the evaluation standard for conversational long-context performance.