Natural-Language Policy Violations
Medium
- Confidence
- 84% confidence
- Finding
- The README instructs users to bake a permanent behavioral policy into an agent's system prompt so the model 'always' writes in a prescribed way, including hard bans on words and response patterns. In an agent skill context, this acts as prompt-level behavior modification without task-level consent, which can override user intent, distort outputs, and create covert policy shaping across unrelated tasks.
