Natural-Language Policy Violations
High
- Confidence
- 97% confidence
- Finding
- The skill explicitly tells the agent to prioritize winning over being safe or diplomatic, which can pressure the model to override normal behavioral safeguards during debate generation. In practice, this increases the chance the agent will produce abusive, extremist, deceptive, or otherwise policy-violating content when prompted with controversial topics, especially because the output is intended for public consumption.
