RAG in Production: Chunking, Freshness, and Failure Modes

Retrieval-augmented generation looks simple in a notebook: embed documents, embed the question, fetch top‑k, ask the model to synthesize. In production, the hard parts are seldom the vector database—they are document lifecycle, chunk boundaries, query understanding, and the quiet ways answers become plausible but wrong. Below is a field-tested way to think about each layer without pretending one architecture fits every domain.

Chunking is a semantics problem, not a token budget

Fixed-size windows are a reasonable bootstrap, but they split tables mid-row and separate definitions from the terms they explain. Prefer structure-aware splitting when sources allow: headings, HTML sections, PDF blocks, or code fences. Where structure is messy, use sliding overlaps sparingly—overlap reduces orphan context but increases redundant hits and cost. Measure chunk utility with offline probes: for each chunk, can a small model answer “what question is this chunk answering?” If not, revise boundaries.

Freshness beats perfect embeddings

Stale corpora produce confident errors. Treat ingestion as a pipeline with explicit versions: source timestamp, extraction hash, and optional TTL per collection. For rapidly changing domains, pair vector retrieval with metadata filters (date, product line, region) before similarity search. When users ask “what is the latest policy,” your system should route to time-sorted retrieval—not only nearest neighbors in embedding space.

Query routing before retrieval

Not every question benefits from document search. Maintain a lightweight router (rules + small classifier) that decides among: direct LLM answer, single-doc Q&A, multi-hop retrieval, or refusal when scope is unclear. Routing cuts noise and reduces the model’s temptation to “connect dots” across unrelated chunks.

Failure modes to rehearse

Near-duplicate collision: multiple chunks paraphrase the same fact with conflicting details—tie-break with recency or source authority.
Negative space: the answer exists between chunks; consider parent-child indexing or summarization tiers.
Tool hallucination: the model cites documents that were not in the prompt—enforce citation IDs from retrieved set only.

Operational checklist

Log retrieval sets per request (hashed) for replay during regressions.
Canary new embedders or rerankers on shadow traffic before cutover.
Alert on sudden shifts in empty-result rate or average similarity score.

RAG is not a model—it is a system. The organizations that win treat retrieval, freshness, and evaluation as first-class engineering concerns, not hyperparameters.

Rollout discipline

Before promoting rerankers or embedders, run shadow comparisons on hashed queries and measure not only score deltas but empty-result rate and citation validity on a fixed probe set. Tie go/no-go criteria to product impact—support ticket themes, not only offline metrics.

Content governance first

Confident wrong answers often come from dirty corpora: duplicate policy versions, stale PDFs, or sections that contradict each other. Invest in deduplication, authoritative sources, and deletion workflows before chasing marginal gains from new embedders. Many production “RAG incidents” are content operations failures visible only after retrieval exposes them.

RAG in production: chunking, freshness, and failure modes