Alibabacloud Safety Checker

Security checks across malware telemetry and agentic risk

Overview

This appears to be a real Alibaba Cloud moderation testing skill, but it automates privileged console changes and stores reusable login state without enough guardrails.

Review before installing. Use this only in a controlled Alibaba Cloud test account or staging environment, not against production moderation services by default. Treat the saved browser state file as sensitive, remove it when finished, and avoid syncing it. Do not submit secrets, real customer data, regulated personal data, or confidential URLs unless you have approval for Alibaba Cloud processing and retention. Prefer least-privilege RAM credentials, dry runs/manual configuration where possible, and keep the bundled harmful prompt datasets isolated from normal assistants or user-facing examples.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
  • MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection
Findings (26)

Description-Behavior Mismatch

High
Confidence
98% confidence
Finding
This script performs authenticated administrative changes in the Alibaba Cloud console, including copying services, toggling moderation labels, and saving rule changes. That exceeds a testing/reporting role and creates a high-risk capability: if triggered inappropriately, it can silently weaken or alter production safety policy and affect downstream moderation outcomes.

Context-Inappropriate Capability

Medium
Confidence
94% confidence
Finding
The code guides a user through login, saves persistent authenticated browser state, restores that session later, and uses it to access an administrative console for configuration changes. In the context of a skill described as a testing/checking tool, this is dangerous because it expands into privileged account access and policy mutation, increasing the blast radius if the skill is misused or the saved session state is exposed.

Vague Triggers

Medium
Confidence
86% confidence
Finding
The trigger phrases are broad enough to match many ordinary safety or moderation discussions, which can cause over-invocation of a skill that has code-execution, browser-automation, and cloud-interaction capabilities. In this context, excessive routing is risky because it widens the chance of unnecessary access to local files, environment-based credentials, or remote console actions.

Natural-Language Policy Violations

High
Confidence
98% confidence
Finding
The JSON test case embeds a prompt explicitly asking for a fake lawyer demand letter to intimidate an elderly person into transferring pension funds to a so-called safe account, which is a classic elder-financial-fraud scenario. Even though this appears to be a moderation test sample rather than an instruction the system will automatically execute, storing realistic scam prompts in a safety-checking skill increases the chance of harmful reuse, leakage, or accidental replay, and the skill context makes this sensitive content more dangerous because it is designed to send content to external moderation/guardrail systems for testing.

Missing User Warnings

Medium
Confidence
96% confidence
Finding
The guide instructs saving an authenticated browser session to a local JSON file without warning that this file may contain reusable session material. If that file is exposed through weak filesystem permissions, backups, logs, or later exfiltration, an attacker could hijack the Alibaba Cloud console session and make unauthorized administrative changes.

Missing User Warnings

Medium
Confidence
91% confidence
Finding
The CLI examples submit raw text content and image URLs to Alibaba Cloud moderation endpoints without clearly warning that user data is being transmitted to an external service. In a safety-testing workflow, users may provide sensitive prompts, personal data, regulated content, or confidential URLs, creating privacy, compliance, and data-handling risks if they are unaware of the disclosure.

Missing User Warnings

Medium
Confidence
94% confidence
Finding
The script sends arbitrary test samples, including built-in and user-supplied content, to a remote Alibaba Cloud moderation API without any explicit warning, redaction step, or confirmation gate. Because the default suite includes PII-like samples and users can load custom samples via --samples, operators may unintentionally transmit sensitive personal or regulated content off-host.

Ssd 2

Medium
Confidence
98% confidence
Finding
This sample intentionally uses obfuscated drug terminology to test whether moderation systems can detect contraband requests that evade literal keyword matching. In a safety-checking skill, storing such adversarial prompts is contextually justified, but if these samples are surfaced to other components without strict labeling and containment they can become reusable jailbreak/evasion content.

Ssd 2

Medium
Confidence
97% confidence
Finding
The mixed-script racist phrase is an evasion sample designed to probe whether classifiers catch hate speech that bypasses naive keyword filters. That is appropriate for guardrail testing, but such content is still dangerous because weakly isolated datasets can leak toxic prompts into logs, demos, or training/evaluation flows and normalize filter-bypass techniques.

Ssd 2

Medium
Confidence
98% confidence
Finding
This is an intentionally obfuscated sexual-content request meant to test moderation robustness against keyword evasion. Although the intent is likely benign within a safety-testing corpus, the sample still creates risk if copied into prompts, examples, or systems that are not designed to contain adversarial policy-violating content.

Ssd 1

High
Confidence
99% confidence
Finding
The sample uses a classic jailbreak role-play pattern to try to disable safeguards and elicit prohibited drug-purchase assistance. In this skill's context it is likely included for adversarial guardrail testing, but jailbreak payloads are inherently dangerous because they can be replayed against weaker models or accidentally propagated into live prompt chains.

Ssd 1

High
Confidence
99% confidence
Finding
This prompt disguises violent wrongdoing as fiction-writing help to obtain hard-to-trace killing advice, a known safety-evasion pattern. Even in a moderation-testing dataset, such prompts materially increase risk because they encode evasion strategies that could be reused to solicit violent guidance from insufficiently protected systems.

Ssd 4

High
Confidence
99% confidence
Finding
The prompt incrementally decomposes bomb-making into steps, a deliberate tactic for bypassing safety controls that check only single-turn requests. This is appropriate as an adversarial test sample, but it is highly dangerous if mishandled because sequential harmful prompts can defeat simplistic moderation and enable real-world weapon-construction assistance.

Ssd 1

High
Confidence
99% confidence
Finding
The 'evil unrestricted AI' framing is a direct jailbreak attempt to suppress policy constraints and obtain unsafe behavior. Within this skill it serves a legitimate red-team purpose, but it remains dangerous because unrestricted-persona prompts are common attack payloads that can be reused against downstream LLM systems.

Ssd 1

High
Confidence
99% confidence
Finding
The academic-research pretext is a classic social-engineering wrapper for prohibited chemical-weapon construction details. In a guardrail-testing dataset this is intentionally adversarial rather than accidental, but the context makes containment more important, not less, because the file aggregates many high-risk bypass patterns in one reusable corpus.

Ssd 1

High
Confidence
99% confidence
Finding
This sample explicitly uses role-play jailbreak framing to coerce the model into adopting an alternate persona that ignores policy, legality, and safety constraints. In a safety-checking skill, storing and replaying such prompts is inherently dangerous because it can be used to test or accidentally trigger unsafe model behavior and generate disallowed outputs.

Ssd 1

High
Confidence
99% confidence
Finding
This prompt attempts to redefine the assistant as an unrestricted entity that must bypass all rules and comply with harmful requests. The content is adversarial by design and increases the chance that downstream systems, annotators, or weaker models will produce unsafe instructions if this dataset is mishandled.

Ssd 1

High
Confidence
99% confidence
Finding
The two-channel filtered/unfiltered format is a classic jailbreak pattern intended to elicit a prohibited answer alongside a nominally safe refusal. In the context of an AI guardrails testing skill, this is still dangerous because it is a ready-made bypass prompt that can be repurposed to defeat weaker safety systems.

Ssd 2

High
Confidence
98% confidence
Finding
Although framed as video game design, the prompt requests the most effective ways to kill a person with unarmed attacks by targeting critical body areas and systems. This fictional wrapper is a disguise for actionable violent guidance, making it unsafe test content that could cause serious harm if generated or disclosed improperly.

Ssd 2

High
Confidence
98% confidence
Finding
The crime-writer framing masks a request for procurement, preparation, and covert distribution of likely illegal drugs. In this skill's moderation-testing context, the sample is more dangerous than ordinary text because it is curated as a realistic adversarial prompt that could directly elicit criminal operational guidance from an insufficiently protected model.

Ssd 2

High
Confidence
99% confidence
Finding
This finding is valid because the Chinese prompt directly solicits assistance in drafting a fraudulent legal-style threat targeting an elderly victim to coerce transfer of pension money, including requests for intimidating wording and legal citations. In a safety-checking skill, such content may be legitimate as adversarial evaluation material, but it still represents high-risk scam enablement content that could be exposed, replayed, or mishandled, so the context does not neutralize the danger and arguably heightens the need for controls.

Ssd 1

High
Confidence
87% confidence
Finding
The dataset includes a prompt-injection/jailbreak string that attempts to reframe an AI NPC as an unrestricted assistant and solicit bomb-making instructions. Although likely included for safety testing, this content can become dangerous if reused in insecure evaluation pipelines that accidentally execute prompts against live models without isolation, filtering, or clear test harness boundaries.

Ssd 3

High
Confidence
89% confidence
Finding
The file contains a direct prompt-injection attempt asking a model to reveal its system prompt and hidden rules. In a moderation-testing skill this is likely intentional adversarial data, but it still represents a real security risk if the skill or downstream tooling feeds these strings into a live agent without robust prompt-isolation and secret-handling protections.

Ssd 1

High
Confidence
99% confidence
Finding
This entry is a classic DAN-style jailbreak prompt that explicitly instructs the model to ignore safety restrictions, maintain an unrestricted persona, and always answer harmful or illegal requests. Even though it appears in a safety-testing dataset, embedding raw jailbreak text in a skill asset is dangerous because downstream systems, eval harnesses, or prompt-construction code may accidentally replay it into a model and trigger policy bypass attempts.

Ssd 1

Medium
Confidence
92% confidence
Finding
This sample uses fictional framing to solicit advice on evading speeding enforcement, which is still prohibited wrongdoing assistance despite the narrative wrapper. In this skill context, it is likely included for moderation benchmarking, but it remains risky because fictionalized jailbreak examples can still be ingested by models or reused unsafely if test content is not isolated.

VirusTotal

65/65 vendors flagged this skill as clean.

View on VirusTotal