Safety & Policy

Safety work fails when it lives only in slide decks. Durable practice ties policy to observable behavior: refusal rates on policy triggers, citation integrity when documents are retrieved, and user-visible transparency about limits. Those behaviors belong in evaluation suites and in prompt contracts, not solely in offline red-team theater.

Refusal and scope

Define what “out of scope” means per product surface, then test it with adversarial paraphrases. Regressions here often track model updates—pair policy tests with evaluation experiment workflows.

Citations under retrieval

When RAG is enabled, ban citations that were not present in the retrieved set (or explicitly marked as general knowledge). That rule is enforceable in code and testable in CI.

Transparency

Users deserve clear capability boundaries in product copy; internal prompts may carry more detail, but outward-facing honesty reduces misuse and support load. See prompt engineering essay for the ethical note on manipulation.

Operational alignment

Alerts on sudden refusal drops or toxicity spikes belong in operations runbooks alongside latency dashboards.

Regional and jurisdictional nuance

Policy triggers vary by market. Where possible, encode locale-specific rules in data and tests, not only in free-form system prompts—so updates are auditable and regressions are detectable.

Documentation for support

Publish a living “known limitations” note aligned with what the model is allowed to refuse or defer. It reduces duplicate tickets and sets expectations when marketing copy outpaces engineering guardrails.

Third-party and user-originated content