Cost and latency are not afterthoughts—they reshape what “good” evaluation means. If every deep eval run burns thousands of dollars in tokens and reviewer hours, you will naturally under-invest in edge cases unless you make that trade-off explicit. This theme threads through evaluation, RAG, and prompt essays rather than living in a separate spreadsheet.

Unit economics

Track cost per successful task, not only per request. Failed retries, oversized contexts, and unnecessary retrieval hops multiply spend. Routing and prompt trimming are levers the same metrics should expose.

Perceived speed

Users experience latency as time-to-first-token, streaming smoothness, and turnaround for long jobs. Engineering metrics should align with those perceptions; otherwise you optimize GPU utilization while churn rises. Release-focused readers should pair this theme with Path C.

Budget-aware evaluation

Stratify measurement depth by risk tier: smoke tests on every change, deep human review on high-stakes surfaces only. That is how teams sustain rigor without infinite headcount.

Caching and reuse

Prompt caching, embedding reuse for repeated queries, and summarization of long histories can cut cost dramatically—each lever changes failure modes (stale cache entries, loss of nuance). Document which layers are safe to cache per surface and add tests when cache invalidation must track document updates.

Right-sizing models

Smaller models for routing, classification, or extraction—with larger models only for generation—often beat “one big model everywhere” on both latency and bill. The split only works if interfaces between stages are explicit in schemas and tests.

Batch and offline workloads

Backfills, summarization jobs, and embedding pipelines have different SLOs than online chat. Price them separately and cap concurrency so batch work does not starve interactive traffic or exhaust monthly budgets in a single weekend.