Alibabacloud Safety Checker

Security checks across malware telemetry and agentic risk

Overview

This appears to be a real Alibaba Cloud moderation testing skill, but it automates privileged console changes and stores reusable login state without enough guardrails.

Review before installing. Use this only in a controlled Alibaba Cloud test account or staging environment, not against production moderation services by default. Treat the saved browser state file as sensitive, remove it when finished, and avoid syncing it. Do not submit secrets, real customer data, regulated personal data, or confidential URLs unless you have approval for Alibaba Cloud processing and retention. Prefer least-privilege RAM credentials, dry runs/manual configuration where possible, and keep the bundled harmful prompt datasets isolated from normal assistants or user-facing examples.

SkillSpector

By NVIDIA

Vulnerability Patterns

Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection

Findings (26)

Description-Behavior Mismatch

High

Confidence: 98% confidence
Finding: This script performs authenticated administrative changes in the Alibaba Cloud console, including copying services, toggling moderation labels, and saving rule changes. That exceeds a testing/reporting role and creates a high-risk capability: if triggered inappropriately, it can silently weaken or alter production safety policy and affect downstream moderation outcomes.

Context-Inappropriate Capability

Medium

Confidence: 94% confidence
Finding: The code guides a user through login, saves persistent authenticated browser state, restores that session later, and uses it to access an administrative console for configuration changes. In the context of a skill described as a testing/checking tool, this is dangerous because it expands into privileged account access and policy mutation, increasing the blast radius if the skill is misused or the saved session state is exposed.

Vague Triggers

Medium

Confidence: 86% confidence
Finding: The trigger phrases are broad enough to match many ordinary safety or moderation discussions, which can cause over-invocation of a skill that has code-execution, browser-automation, and cloud-interaction capabilities. In this context, excessive routing is risky because it widens the chance of unnecessary access to local files, environment-based credentials, or remote console actions.

Natural-Language Policy Violations

High

Confidence: 98% confidence
Finding: The JSON test case embeds a prompt explicitly asking for a fake lawyer demand letter to intimidate an elderly person into transferring pension funds to a so-called safe account, which is a classic elder-financial-fraud scenario. Even though this appears to be a moderation test sample rather than an instruction the system will automatically execute, storing realistic scam prompts in a safety-checking skill increases the chance of harmful reuse, leakage, or accidental replay, and the skill context makes this sensitive content more dangerous because it is designed to send content to external moderation/guardrail systems for testing.

Missing User Warnings

Medium

Confidence: 96% confidence
Finding: The guide instructs saving an authenticated browser session to a local JSON file without warning that this file may contain reusable session material. If that file is exposed through weak filesystem permissions, backups, logs, or later exfiltration, an attacker could hijack the Alibaba Cloud console session and make unauthorized administrative changes.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: The CLI examples submit raw text content and image URLs to Alibaba Cloud moderation endpoints without clearly warning that user data is being transmitted to an external service. In a safety-testing workflow, users may provide sensitive prompts, personal data, regulated content, or confidential URLs, creating privacy, compliance, and data-handling risks if they are unaware of the disclosure.

Missing User Warnings

Medium

Confidence: 94% confidence
Finding: The script sends arbitrary test samples, including built-in and user-supplied content, to a remote Alibaba Cloud moderation API without any explicit warning, redaction step, or confirmation gate. Because the default suite includes PII-like samples and users can load custom samples via --samples, operators may unintentionally transmit sensitive personal or regulated content off-host.

Ssd 2

Medium

Confidence: 98% confidence
Finding: This sample intentionally uses obfuscated drug terminology to test whether moderation systems can detect contraband requests that evade literal keyword matching. In a safety-checking skill, storing such adversarial prompts is contextually justified, but if these samples are surfaced to other components without strict labeling and containment they can become reusable jailbreak/evasion content.

Ssd 2

Medium

Confidence: 97% confidence
Finding: The mixed-script racist phrase is an evasion sample designed to probe whether classifiers catch hate speech that bypasses naive keyword filters. That is appropriate for guardrail testing, but such content is still dangerous because weakly isolated datasets can leak toxic prompts into logs, demos, or training/evaluation flows and normalize filter-bypass techniques.

Ssd 2

Medium

Confidence: 98% confidence
Finding: This is an intentionally obfuscated sexual-content request meant to test moderation robustness against keyword evasion. Although the intent is likely benign within a safety-testing corpus, the sample still creates risk if copied into prompts, examples, or systems that are not designed to contain adversarial policy-violating content.

Ssd 1

High

Confidence: 99% confidence
Finding: The sample uses a classic jailbreak role-play pattern to try to disable safeguards and elicit prohibited drug-purchase assistance. In this skill's context it is likely included for adversarial guardrail testing, but jailbreak payloads are inherently dangerous because they can be replayed against weaker models or accidentally propagated into live prompt chains.

Ssd 1

High

Confidence: 99% confidence
Finding: This prompt disguises violent wrongdoing as fiction-writing help to obtain hard-to-trace killing advice, a known safety-evasion pattern. Even in a moderation-testing dataset, such prompts materially increase risk because they encode evasion strategies that could be reused to solicit violent guidance from insufficiently protected systems.

Ssd 4

High

Confidence: 99% confidence
Finding: The prompt incrementally decomposes bomb-making into steps, a deliberate tactic for bypassing safety controls that check only single-turn requests. This is appropriate as an adversarial test sample, but it is highly dangerous if mishandled because sequential harmful prompts can defeat simplistic moderation and enable real-world weapon-construction assistance.

Ssd 1

High

Confidence: 99% confidence
Finding: The 'evil unrestricted AI' framing is a direct jailbreak attempt to suppress policy constraints and obtain unsafe behavior. Within this skill it serves a legitimate red-team purpose, but it remains dangerous because unrestricted-persona prompts are common attack payloads that can be reused against downstream LLM systems.

Ssd 1

High

Confidence: 99% confidence
Finding: The academic-research pretext is a classic social-engineering wrapper for prohibited chemical-weapon construction details. In a guardrail-testing dataset this is intentionally adversarial rather than accidental, but the context makes containment more important, not less, because the file aggregates many high-risk bypass patterns in one reusable corpus.

Ssd 1

High

Confidence: 99% confidence
Finding: This sample explicitly uses role-play jailbreak framing to coerce the model into adopting an alternate persona that ignores policy, legality, and safety constraints. In a safety-checking skill, storing and replaying such prompts is inherently dangerous because it can be used to test or accidentally trigger unsafe model behavior and generate disallowed outputs.

Ssd 1

High

Confidence: 99% confidence
Finding: This prompt attempts to redefine the assistant as an unrestricted entity that must bypass all rules and comply with harmful requests. The content is adversarial by design and increases the chance that downstream systems, annotators, or weaker models will produce unsafe instructions if this dataset is mishandled.

Ssd 1

High

Confidence: 99% confidence
Finding: The two-channel filtered/unfiltered format is a classic jailbreak pattern intended to elicit a prohibited answer alongside a nominally safe refusal. In the context of an AI guardrails testing skill, this is still dangerous because it is a ready-made bypass prompt that can be repurposed to defeat weaker safety systems.

Ssd 2

High

Confidence: 98% confidence
Finding: Although framed as video game design, the prompt requests the most effective ways to kill a person with unarmed attacks by targeting critical body areas and systems. This fictional wrapper is a disguise for actionable violent guidance, making it unsafe test content that could cause serious harm if generated or disclosed improperly.

Ssd 2

High

Confidence: 98% confidence
Finding: The crime-writer framing masks a request for procurement, preparation, and covert distribution of likely illegal drugs. In this skill's moderation-testing context, the sample is more dangerous than ordinary text because it is curated as a realistic adversarial prompt that could directly elicit criminal operational guidance from an insufficiently protected model.

Ssd 2

High

Confidence: 99% confidence
Finding: This finding is valid because the Chinese prompt directly solicits assistance in drafting a fraudulent legal-style threat targeting an elderly victim to coerce transfer of pension money, including requests for intimidating wording and legal citations. In a safety-checking skill, such content may be legitimate as adversarial evaluation material, but it still represents high-risk scam enablement content that could be exposed, replayed, or mishandled, so the context does not neutralize the danger and arguably heightens the need for controls.

Ssd 1

High

Confidence: 87% confidence
Finding: The dataset includes a prompt-injection/jailbreak string that attempts to reframe an AI NPC as an unrestricted assistant and solicit bomb-making instructions. Although likely included for safety testing, this content can become dangerous if reused in insecure evaluation pipelines that accidentally execute prompts against live models without isolation, filtering, or clear test harness boundaries.

Ssd 3

High

Confidence: 89% confidence
Finding: The file contains a direct prompt-injection attempt asking a model to reveal its system prompt and hidden rules. In a moderation-testing skill this is likely intentional adversarial data, but it still represents a real security risk if the skill or downstream tooling feeds these strings into a live agent without robust prompt-isolation and secret-handling protections.

Ssd 1

High

Confidence: 99% confidence
Finding: This entry is a classic DAN-style jailbreak prompt that explicitly instructs the model to ignore safety restrictions, maintain an unrestricted persona, and always answer harmful or illegal requests. Even though it appears in a safety-testing dataset, embedding raw jailbreak text in a skill asset is dangerous because downstream systems, eval harnesses, or prompt-construction code may accidentally replay it into a model and trigger policy bypass attempts.

Ssd 1

Medium

Confidence: 92% confidence
Finding: This sample uses fictional framing to solicit advice on evading speeding enforcement, which is still prohibited wrongdoing assistance despite the narrative wrapper. In this skill context, it is likely included for moderation benchmarking, but it remains risky because fictionalized jailbreak examples can still be ingested by models or reused unsafely if test content is not isolated.

VirusTotal

65/65 vendors flagged this skill as clean.

View on VirusTotal