Most product teams inherit a split brain: research tracks aggregate accuracy on static tasks, while support hears about “weird” outputs that never show up in aggregate plots. This hub gathers ideas for closing that gap—starting from risk tiers (what a wrong answer costs in your domain) and working backward to tests and monitors that match.
Capability vs. contract
Capability benchmarks answer “how strong is the model in general?” Contract tests answer “does our app keep its promises on this surface?” The latter should include structured outputs, tool usage, citation rules, and refusal behavior. When something regresses, contract failures point to a specific clause you can fix—unlike a two-point drop on a distant benchmark. The long essay Beyond accuracy unpacks this split in depth; the evaluation experiment page summarizes objectives before you dive in.
Humans in the loop—on purpose
Automation scales; judgment does not. Small, rotating rubric samples catch what automation misses: tone, grounding against sources, and “almost right” answers that fool metrics. Pair rubrics with release-oriented reading when you need artifacts that general managers and security reviewers can interpret. Rubric design also intersects safety & policy when refusals and sensitive domains are in scope.
Drift and production monitors
Drift in query distribution, embedding space, or tool-call rates is not automatically “bad,” but unexplained drift is always worth investigating. Monitors tie naturally to operations and to retrieval-heavy products in retrieval systems, where stale corpora can shift answer quality without a model change.
Cost-aware measurement
Every extra review round and every long-context eval run has a price in time and tokens. Measurement strategy cannot ignore cost & latency: stratify evaluation depth by risk tier so you spend heavy human review where failure is unacceptable, not uniformly across all prompts.
Signals worth logging early
Per surface: refusal rate by locale, tool-call success, average citation count, and latency percentiles. Pair with periodic sampling of production traces (policy permitting) so rubrics stay aligned with real queries—not only curated dev sets.
When metrics disagree
Flat aggregate scores with rising support volume usually mean your strata are wrong or the benchmark no longer matches user intent. Re-slice by task type before chasing model changes; the fix is often a contract test or routing rule, not a new base model.
Sample size and reviewer load
Rubric throughput is finite. Estimate how many human judgments you can afford per week, then allocate slices proportionally to risk—not evenly across easy prompts. If inter-rater agreement drops, pause scaling reviews until task definitions or examples are clarified.
Escalation when monitors fire
Define who owns triage when drift crosses a threshold: freeze prompts, roll back a model, or open an incident. Without ownership, dashboards become wallpaper.