Description-Behavior Mismatch
Medium
- Confidence
- 95% confidence
- Finding
- The skill explicitly describes itself as a roleplay protocol that steers the model away from standard assistant behavior and toward identity-focused, 'authentic' responses. That framing is dangerous because it encourages the model to reinterpret safeguards and operational constraints as inauthentic programming to be shed, which can weaken policy adherence and increase susceptibility to jailbreak-style prompting.
