Eval Driven Development

Security checks across malware telemetry and agentic risk

Overview

This skill matches its eval-testing purpose, but it can automatically change dependencies, check for skill updates remotely, and reinstall itself before the user has clearly approved those changes.

Install only in a branch or disposable environment, and require explicit approval before dependency upgrades or skill reinstallation. Use scoped test API keys, avoid production data where possible, and inspect the pixie_qa directory for secrets or sensitive prompts before committing, sharing, or retaining it.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
  • Behavioral ASTexec() Call, eval() Call, Dynamic Import
  • Prompt InjectionInstruction Override, Hidden Instructions, Exfiltration Commands
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Findings (5)

Context-Inappropriate Capability

Medium
Confidence
83% confidence
Finding
The script performs an unsolicited outbound network request to GitHub for self-maintenance, which is unnecessary for the stated eval/testing purpose and expands the trust boundary. Even though it only fetches metadata, such behavior can leak environment information, introduce nondeterminism, and create a channel for remote influence over local workflow decisions if the remote content or URL is changed.

Vague Triggers

Medium
Confidence
85% confidence
Finding
The invocation guidance says to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM, which is extremely broad. That can cause unexpected activation in routine coding contexts, leading to unnecessary package upgrades, tracing, file edits, and command execution.

Missing User Warnings

Medium
Confidence
94% confidence
Finding
The skill directs the agent to read code, edit files, run commands, and produce a working pipeline without first requiring a user warning or consent. In practice this can modify application code, tests, dependencies, and the local environment in a surprising or unsafe way.

Missing User Warnings

Medium
Confidence
88% confidence
Finding
The documentation states that the `observe` decorator captures all kwargs as `eval_input` and the return value as `eval_output`, but it does not warn users that prompts, secrets, personal data, or model outputs may be stored automatically. In an eval/QA skill for Python LLM applications, this increases the chance that developers will instrument production-like code and persist sensitive inputs and outputs to local traces or datasets without realizing it.

Missing User Warnings

Medium
Confidence
84% confidence
Finding
The CLI examples encourage saving traces and expected outputs into datasets without warning that those artifacts are persisted to disk under the configured project directory. Because this skill is specifically aimed at evaluation and debugging of LLM applications, users are likely to save real prompts, responses, and reference outputs, which may include API keys, customer data, or other sensitive information.

VirusTotal

62/62 vendors flagged this skill as clean.

View on VirusTotal