Operations is where promises meet reality: on-call, dashboards, and incident timelines. LLM systems need traditional SRE discipline plus artifacts specific to nondeterminism—replay of prompts with hashed retrieval sets, shadow runs for new embedders, and evaluation hooks that fire when distributions shift. This hub links those habits to RAG operations, drift measurement, and safety alerting.
Logging for replay
Store enough context to reproduce a failure without storing raw PII unnecessarily: model identifiers, prompt version hashes, tool outputs, retrieval IDs. Replay drives postmortems described across latest experiments essays.
Shadow and canary
Run new retrieval models or rerankers on duplicated traffic before cutover—pair with cost caps so experiments do not surprise finance.
Alerting
Combine empty-result rate, refusal deltas, and latency SLO breaches. Noise is the enemy; tune thresholds with release managers in the loop on what “customer visible” means.
Postmortem template
Capture timeline, blast radius, prompt and model IDs, retrieval IDs (hashed), whether failure was contract vs. model vs. data pipeline, and follow-up items with owners. Store artifacts next to incident tickets so patterns surface across teams.
Capacity and cost controls
Rate limits, batching, and queueing for heavy jobs protect both SLOs and spend. Pair with cost & latency dashboards so ops changes do not silently shift quality.
Runbooks and on-call
Document first-response steps: when to disable a feature flag, when to roll back a prompt version, and which dashboard confirms recovery. LLM incidents often need both infra and product owners—list both in the escalation path.
Vendor and region failover
If you rely on external APIs, note behaviors that differ across regions or providers (token limits, safety filters, tool schemas). Replay tests should cover failover paths, not only single-vendor happy paths.