Retrieval Systems

Retrieval-augmented generation is a pipeline, not a toggle. Quality emerges—or collapses—at the boundary between document engineering and model behavior: how text is split, how often indexes refresh, how ambiguous queries are routed, and whether citations are constrained to what was actually retrieved. This hub orients you toward the RAG in production essay and the RAG experiment synopsis before you commit weeks to vector tuning alone.

Chunking is semantics-first

Token windows are a constraint, not a strategy. Structure-aware splits (headings, sections, tables) usually beat blind fixed windows for factual Q&A. When chunks split definitions from terms—or table rows from headers—retrieval returns plausible-looking context that misleads the model. That is both a data problem and an evaluation problem: your test suite should include probes for boundary cases.

Freshness and authority

Embeddings age silently. Version your corpora, store source timestamps, and fail tests when authoritative documents change without a corresponding re-index. Freshness ties directly to operations (pipelines, alerts) and to policy when outdated guidance could harm users.

Routing and ambiguity

Not every user message should trigger document search. A lightweight router (rules plus a small classifier) can send simple questions to direct completion, multi-hop questions to retrieval, and out-of-scope requests to refusal—reducing both cost (cost & latency) and hallucination pressure. The data-heavy reading path sequences RAG before layered evaluation on purpose.

Prompts still matter

Retrieval output is fed through instructions and schemas defined in prompt artifacts. Citation discipline (“only cite retrieved IDs”) is as much interface design as vector math.

Evaluating retrieval in isolation

Offline nDCG or MRR on a static set miss corpus drift. Prefer probes that change when documents update: answerability (“did we retrieve enough to answer?”), authority conflicts (two chunks disagree), and recency for time-sensitive queries. Shadow rerankers before promotion and compare empty-result rate as well as score deltas.

Multi-lingual and OCR noise

Embedding similarity is not language-neutral in practice. If your corpus mixes languages or scanned PDFs, bake encoding and normalization into ingestion tests; failures often show up as retrieval variance before the LLM layer looks “wrong.”

Graphs and structured sources