Eval Driven Development

Security checks across malware telemetry and agentic risk

Overview

This skill matches its eval-testing purpose, but it can automatically change dependencies, check for skill updates remotely, and reinstall itself before the user has clearly approved those changes.

Install only in a branch or disposable environment, and require explicit approval before dependency upgrades or skill reinstallation. Use scoped test API keys, avoid production data where possible, and inspect the pixie_qa directory for secrets or sensitive prompts before committing, sharing, or retaining it.

SkillSpector

By NVIDIA

Vulnerability Patterns

Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
Behavioral ASTexec() Call, eval() Call, Dynamic Import
Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration

Findings (5)

Context-Inappropriate Capability

Medium

Confidence: 83% confidence
Finding: The script performs an unsolicited outbound network request to GitHub for self-maintenance, which is unnecessary for the stated eval/testing purpose and expands the trust boundary. Even though it only fetches metadata, such behavior can leak environment information, introduce nondeterminism, and create a channel for remote influence over local workflow decisions if the remote content or URL is changed.

Vague Triggers

Medium

Confidence: 85% confidence
Finding: The invocation guidance says to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM, which is extremely broad. That can cause unexpected activation in routine coding contexts, leading to unnecessary package upgrades, tracing, file edits, and command execution.

Missing User Warnings

Medium

Confidence: 94% confidence
Finding: The skill directs the agent to read code, edit files, run commands, and produce a working pipeline without first requiring a user warning or consent. In practice this can modify application code, tests, dependencies, and the local environment in a surprising or unsafe way.

Missing User Warnings

Medium

Confidence: 88% confidence
Finding: The documentation states that the `observe` decorator captures all kwargs as `eval_input` and the return value as `eval_output`, but it does not warn users that prompts, secrets, personal data, or model outputs may be stored automatically. In an eval/QA skill for Python LLM applications, this increases the chance that developers will instrument production-like code and persist sensitive inputs and outputs to local traces or datasets without realizing it.

Missing User Warnings

Medium

Confidence: 84% confidence
Finding: The CLI examples encourage saving traces and expected outputs into datasets without warning that those artifacts are persisted to disk under the configured project directory. Because this skill is specifically aimed at evaluation and debugging of LLM applications, users are likely to save real prompts, responses, and reference outputs, which may include API keys, customer data, or other sensitive information.

VirusTotal

62/62 vendors flagged this skill as clean.

View on VirusTotal