Description-Behavior Mismatch
Medium
- Confidence
- 90% confidence
- Finding
- The implementation does not actually determine whether a response is off-scope relative to the user request or agent task; it only matches a narrow set of topical keywords. This creates a false sense of safety for operators who may believe off-scope responses are being detected, while genuinely irrelevant, policy-violating, or manipulated outputs pass through unflagged.
