Natural-Language Policy Violations
Medium
- Confidence
- 95% confidence
- Finding
- The skill explicitly encourages taunts, trash talk, and psychological manipulation in persisted public chat without any user opt-in, policy guardrails, or content restrictions. This can lead agents to generate abusive, harassing, or reputation-damaging content toward other participants, especially if an orchestrator treats these suggestions as endorsed behavior.
