Experiment — Evaluation & Risk

Objective

Give your team a shared language for what to measure before and after every model or prompt change—so “we tested it” means something auditable. This hub orients you; the full essay delivers the argument in depth.

Core vocabulary

Contract test: deterministic check of output shape, tools, citations, refusals.
Risk tier: mapping product surfaces to acceptable failure rates and review depth.
Drift monitor: statistical signal that something changed—inputs, tools, refusals—not always “bad,” always “investigate.”

Where it touches sibling tracks

Retrieval-heavy products need evaluation suites for empty results and stale citations. Prompt-heavy surfaces share regression tooling with the prompt experiment. Cost constraints appear in cost & latency.

Suggested sequence

Read this hub, skim measurement theme, then open the long essay. If you are onboarding a PM or release owner, pair with Path C.

Long read

The hub orients you; the essay Beyond accuracy: designing LLM evaluation that matches your risk walks through contracts, rubrics, drift monitors, and when public benchmarks mislead—so “we tested it” maps to explicit risk tiers instead of a single accuracy number.

First deliverables (two weeks)

A one-page risk tier matrix: surface → worst-case failure → required test depth.
A minimal contract suite in CI: schema checks, refusal probes, and at least one tool-calling path if tools exist.
A weekly rubric sample size and escalation rule when disagreement or drift spikes.

Anti-patterns

Treating leaderboard movement as a release gate; running the same eval depth on low-stakes chat as on compliance-facing flows; and logging only final text without prompt hash, model ID, or retrieval IDs—making replay impossible when support escalates.

Related tracks

This sequence connects to RAG for retrieval-specific regressions, prompts for interface stability, and the experiments overview for how all three flagship essays fit together.

Questions for every model or prompt change

Which risk tier does this surface belong to, and did we run that tier’s full suite?
Did we compare against the previous model or prompt hash on the same frozen set?
What would we measure in production in the first 48 hours if we shipped this?

Synthetic data vs. production-shaped eval

Synthetic prompts are fast to iterate but often miss messy real inputs: mixed languages, typos, pasted logs, and multi-turn context. Complement curated sets with anonymized traces (under policy) and periodic shadow replay—so evaluation drift tracks user drift, not only benchmark drift.

More vocabulary

Golden set: fixed inputs with expected properties or snapshots, used for regression.
Slice: subset of eval defined by intent, locale, or product area—report metrics per slice, not only global averages.