Objective
Give your team a shared language for what to measure before and after every model or prompt change—so “we tested it” means something auditable. This hub orients you; the full essay delivers the argument in depth.
Core vocabulary
- Contract test: deterministic check of output shape, tools, citations, refusals.
- Risk tier: mapping product surfaces to acceptable failure rates and review depth.
- Drift monitor: statistical signal that something changed—inputs, tools, refusals—not always “bad,” always “investigate.”
Where it touches sibling tracks
Retrieval-heavy products need evaluation suites for empty results and stale citations. Prompt-heavy surfaces share regression tooling with the prompt experiment. Cost constraints appear in cost & latency.
Suggested sequence
Read this hub, skim measurement theme, then open the long essay. If you are onboarding a PM or release owner, pair with Path C.
Long read
The hub orients you; the essay Beyond accuracy: designing LLM evaluation that matches your risk walks through contracts, rubrics, drift monitors, and when public benchmarks mislead—so “we tested it” maps to explicit risk tiers instead of a single accuracy number.
First deliverables (two weeks)
- A one-page risk tier matrix: surface → worst-case failure → required test depth.
- A minimal contract suite in CI: schema checks, refusal probes, and at least one tool-calling path if tools exist.
- A weekly rubric sample size and escalation rule when disagreement or drift spikes.
Anti-patterns
Treating leaderboard movement as a release gate; running the same eval depth on low-stakes chat as on compliance-facing flows; and logging only final text without prompt hash, model ID, or retrieval IDs—making replay impossible when support escalates.
Related tracks
This sequence connects to RAG for retrieval-specific regressions, prompts for interface stability, and the experiments overview for how all three flagship essays fit together.
Questions for every model or prompt change
- Which risk tier does this surface belong to, and did we run that tier’s full suite?
- Did we compare against the previous model or prompt hash on the same frozen set?
- What would we measure in production in the first 48 hours if we shipped this?
Synthetic data vs. production-shaped eval
Synthetic prompts are fast to iterate but often miss messy real inputs: mixed languages, typos, pasted logs, and multi-turn context. Complement curated sets with anonymized traces (under policy) and periodic shadow replay—so evaluation drift tracks user drift, not only benchmark drift.
More vocabulary
- Golden set: fixed inputs with expected properties or snapshots, used for regression.
- Slice: subset of eval defined by intent, locale, or product area—report metrics per slice, not only global averages.