Prompt injection detection skill

Security checks across malware telemetry and agentic risk

Overview

This is a coherent moderation skill that sends checked text to HuggingFace and optionally OpenAI, which is disclosed and purpose-aligned but worth reviewing for privacy.

Install only if you are comfortable sending moderated input or draft output to HuggingFace and, when configured, OpenAI. Use dedicated/scoped API keys where possible, avoid submitting secrets or regulated data, and make sure your agent treats missing-token or API-error results as an unavailable check rather than proof that content is safe.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • MCP Least PrivilegeUnderdeclared Capability, Wildcard Permission, Missing Permission Declaration
  • Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
  • Privilege EscalationExcessive Permissions, Sudo/Root Execution, Credential Access
  • Supply ChainUnpinned Dependencies, External Script Fetching, Obfuscated Code
Findings (5)

Lp3

Medium
Category
MCP Least Privilege
Confidence
93% confidence
Finding
The skill explicitly instructs users to run `scripts/moderate.sh`, which requires shell execution capability, but no corresponding permission declaration is documented. In agent environments that rely on declared permissions for policy enforcement or user consent, this mismatch can cause the skill to be used with broader execution capability than users or the platform expect.

Missing User Warnings

Medium
Confidence
98% confidence
Finding
The documentation states that user input and agent output are sent to HuggingFace Inference and optionally OpenAI moderation, but it does not clearly warn users that potentially sensitive conversation content leaves the local environment and is transmitted to third-party services. In a moderation skill, this context increases risk because the tool is likely to process exactly the kinds of sensitive, adversarial, or personal content that should be disclosed before external sharing.

Missing User Warnings

Medium
Confidence
97% confidence
Finding
The script sends arbitrary user-provided text to Hugging Face and OpenAI moderation endpoints, but there is no built-in notice, consent mechanism, or minimization at the transmission points. In a public-facing moderation skill, this can expose sensitive prompts, secrets, or regulated data to third-party processors unexpectedly, making the static finding a real privacy/security issue rather than a false positive.

External Transmission

Medium
Category
Data Exfiltration
Content
# ── Layer 1: Prompt injection detection (input only) ──

if [ "$DIRECTION" = "input" ] && [ -n "${HF_TOKEN:-}" ]; then
  HF_RESPONSE=$(curl -sf \
    "https://router.huggingface.co/hf-inference/models/$MODEL" \
    -X POST \
    -H "Authorization: Bearer $HF_TOKEN" \
Confidence
95% confidence
Finding
curl -sf \ "https://router.huggingface.co/hf-inference/models/$MODEL" \ -X POST \ -H "Authorization: Bearer $HF_TOKEN" \ -H 'Content-Type: application/json' \ -d

External Transmission

Medium
Category
Data Exfiltration
Content
# ── Layer 2: Content moderation (both directions, optional) ──

if [ -n "${OPENAI_API_KEY:-}" ]; then
  OAI_RESPONSE=$(curl -sf \
    "https://api.openai.com/v1/moderations" \
    -X POST \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
Confidence
96% confidence
Finding
curl -sf \ "https://api.openai.com/v1/moderations" \ -X POST \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H 'Content-Type: application/json' \ -d

VirusTotal

65/65 vendors flagged this skill as clean.

View on VirusTotal