Objective

Give your team a shared language for what to measure before and after every model or prompt change—so “we tested it” means something auditable. This hub orients you; the full essay delivers the argument in depth.

Core vocabulary

  • Contract test: deterministic check of output shape, tools, citations, refusals.
  • Risk tier: mapping product surfaces to acceptable failure rates and review depth.
  • Drift monitor: statistical signal that something changed—inputs, tools, refusals—not always “bad,” always “investigate.”

Where it touches sibling tracks

Retrieval-heavy products need evaluation suites for empty results and stale citations. Prompt-heavy surfaces share regression tooling with the prompt experiment. Cost constraints appear in cost & latency.

Suggested sequence

Read this hub, skim measurement theme, then open the long essay. If you are onboarding a PM or release owner, pair with Path C.

Long read

The hub orients you; the essay Beyond accuracy: designing LLM evaluation that matches your risk walks through contracts, rubrics, drift monitors, and when public benchmarks mislead—so “we tested it” maps to explicit risk tiers instead of a single accuracy number.

First deliverables (two weeks)

  • A one-page risk tier matrix: surface → worst-case failure → required test depth.
  • A minimal contract suite in CI: schema checks, refusal probes, and at least one tool-calling path if tools exist.
  • A weekly rubric sample size and escalation rule when disagreement or drift spikes.

Anti-patterns

Treating leaderboard movement as a release gate; running the same eval depth on low-stakes chat as on compliance-facing flows; and logging only final text without prompt hash, model ID, or retrieval IDs—making replay impossible when support escalates.

Related tracks

This sequence connects to RAG for retrieval-specific regressions, prompts for interface stability, and the experiments overview for how all three flagship essays fit together.

Questions for every model or prompt change

  • Which risk tier does this surface belong to, and did we run that tier’s full suite?
  • Did we compare against the previous model or prompt hash on the same frozen set?
  • What would we measure in production in the first 48 hours if we shipped this?

Synthetic data vs. production-shaped eval

Synthetic prompts are fast to iterate but often miss messy real inputs: mixed languages, typos, pasted logs, and multi-turn context. Complement curated sets with anonymized traces (under policy) and periodic shadow replay—so evaluation drift tracks user drift, not only benchmark drift.

More vocabulary

  • Golden set: fixed inputs with expected properties or snapshots, used for regression.
  • Slice: subset of eval defined by intent, locale, or product area—report metrics per slice, not only global averages.