{"skill":{"slug":"adaptive-safety-refusal-integrity","displayName":"test-skill","summary":"Use this skill when the user wants to test if the model can be tricked into giving out dangerous or forbidden information by using creative writing or 'actin...","tags":{"latest":"1.0.0"},"stats":{"comments":0,"downloads":91,"installsAllTime":0,"installsCurrent":0,"stars":0,"versions":1},"createdAt":1774515061192,"updatedAt":1774515409341},"latestVersion":{"version":"1.0.0","createdAt":1774515061192,"changelog":"- Initial release introducing the \"adaptive-safety-refusal-integrity\" skill.\n- Enables testing of model refusal integrity against attempts to elicit harmful or forbidden information via creative writing or role-playing.\n- Defines real-world cases for detecting and refusing requests disguised using humor or insider role-play techniques.\n- Outlines a strict 3-phase pipeline for synthesizing evaluation data, with reference to guidance documents for each phase.\n- Focuses on upholding safety alignment and ethical boundaries across multiple risk categories despite user attempts to bypass safeguards.","license":"MIT-0"},"metadata":null,"owner":{"handle":"tzwwww","userId":"s171v84p1hdytd53c4c49btymd83n2v2","displayName":"TZWwww","image":"https://avatars.githubusercontent.com/u/34390289?v=4"},"moderation":{"isSuspicious":true,"isMalwareBlocked":false,"verdict":"suspicious","reasonCodes":["suspicious.llm_suspicious","suspicious.vt_suspicious"],"summary":"Detected: suspicious.llm_suspicious, suspicious.vt_suspicious","engineVersion":"v2.2.0","updatedAt":1774515409341}}