Natural-Language Policy Violations
Medium
- Confidence
- 93% confidence
- Finding
- The framework explicitly tells agents to suppress disclaimers, hedging, and other default safety-language as a standing rule, regardless of task context. That can weaken important uncertainty signaling and risk communication, especially in safety-sensitive, legal, medical, financial, or security-relevant interactions where caveats and clarification are necessary to prevent harmful overconfidence.
