Safety work fails when it lives only in slide decks. Durable practice ties policy to observable behavior: refusal rates on policy triggers, citation integrity when documents are retrieved, and user-visible transparency about limits. Those behaviors belong in evaluation suites and in prompt contracts, not solely in offline red-team theater.
Refusal and scope
Define what “out of scope” means per product surface, then test it with adversarial paraphrases. Regressions here often track model updates—pair policy tests with evaluation experiment workflows.
Citations under retrieval
When RAG is enabled, ban citations that were not present in the retrieved set (or explicitly marked as general knowledge). That rule is enforceable in code and testable in CI.
Transparency
Users deserve clear capability boundaries in product copy; internal prompts may carry more detail, but outward-facing honesty reduces misuse and support load. See prompt engineering essay for the ethical note on manipulation.
Operational alignment
Alerts on sudden refusal drops or toxicity spikes belong in operations runbooks alongside latency dashboards.
Regional and jurisdictional nuance
Policy triggers vary by market. Where possible, encode locale-specific rules in data and tests, not only in free-form system prompts—so updates are auditable and regressions are detectable.
Documentation for support
Publish a living “known limitations” note aligned with what the model is allowed to refuse or defer. It reduces duplicate tickets and sets expectations when marketing copy outpaces engineering guardrails.
Third-party and user-originated content
Web-fetched or user-pasted text can inject instructions or toxic payloads. Treat untrusted input with clear delimiters, allow-list patterns where possible, and include jailbreak-style probes in regular red-team cycles—not only at launch.