Natural-Language Policy Violations
Medium
- Confidence
- 94% confidence
- Finding
- The guardrail explicitly permits unsafe content whenever the user asks for it, with human approval as the only backstop. That weakens the safety boundary from 'disallow harmful output' to 'allow harmful drafting,' which can facilitate harassment, privacy violations, threats, or other damaging content even if a human is expected to approve it before sending.
