Back to skill

Security audit

Improvement Evaluator

Security checks across malware telemetry and agentic risk

Overview

This skill is a disclosed execution evaluator that runs task suites through Claude and optional pytest checks; its behavior matches its purpose but should be used with trusted inputs.

Install only if you intend to run execution-based evaluations. Use --mock for dry runs, review task suites and fixtures before running them, avoid putting secrets or private customer data in candidate skills or prompts, and run pytest-based judges in a sandbox when fixtures are not fully trusted.

SkillSpector

By NVIDIA
Vulnerability Patterns
  • Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
  • Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
  • Output HandlingUnvalidated Output Injection, Cross-Context Output, Unbounded Output
  • Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
  • Behavioral ASTexec() Call, eval() Call, Dynamic Import
Findings (9)

subprocess module call

Medium
Category
Dangerous Code Execution
Content
output_file = Path(tmpdir) / "ai_output.txt"
            output_file.write_text(output, encoding="utf-8")

            result = subprocess.run(
                ["python3", "-m", "pytest", str(test_path), "-v",
                 "--tb=short", f"--rootdir={tmpdir}"],
                capture_output=True,
Confidence
88% confidence
Finding
result = subprocess.run( ["python3", "-m", "pytest", str(test_path), "-v", "--tb=short", f"--rootdir={tmpdir}"], capture_output=True,

subprocess module call

Medium
Category
Dangerous Code Execution
Content
f"Respond with ONLY a JSON object: {{\"score\": <float>, \"reasoning\": \"<str>\"}}"
        )
        try:
            result = subprocess.run(
                ["claude", "-p", "--output-format", "json"],
                input=prompt,
                capture_output=True,
Confidence
91% confidence
Finding
result = subprocess.run( ["claude", "-p", "--output-format", "json"], input=prompt, capture_output=True, text=True,

Context-Inappropriate Capability

Medium
Confidence
86% confidence
Finding
The skill's stated purpose is evaluation, but this implementation achieves that by executing pytest-based test logic, which is effectively arbitrary Python code execution. In skill context, that is more dangerous because evaluators may be assumed to be passive/scoring-only components, while this one can run powerful code paths from test fixtures and plugin hooks.

Context-Inappropriate Capability

Medium
Confidence
84% confidence
Finding
This judge expands the skill from local pass/fail evaluation into invocation of an external model CLI. That broadens the operational and trust boundary beyond the manifest description and can expose data or create non-deterministic behavior not obvious to users of the skill. The context increases concern because evaluators are often expected to be reproducible and self-contained.

Description-Behavior Mismatch

High
Confidence
97% confidence
Finding
This task suite is materially misaligned with the declared purpose of the improvement-evaluator skill. Instead of measuring whether skill changes improve execution pass rate on predefined tasks, it evaluates a different skill's gate-ordering, decision logic, and review workflows, which can cause the evaluator to certify the wrong capabilities and produce misleading pass/fail evidence.

Context-Inappropriate Capability

Medium
Confidence
90% confidence
Finding
The suite tests gate-validation and human-review CLI operations that are outside the stated scope of an execution-evaluation skill. This broadens the skill's apparent authority and can train or validate the agent to perform operational workflow actions unrelated to evaluation, increasing the chance of misuse, confused-deputy behavior, or over-privileged integrations being exercised under the wrong skill boundary.

Missing User Warnings

Medium
Confidence
90% confidence
Finding
The code sends rubric content and AI output to an external CLI without any visible notice, consent, or redaction step. If evaluated outputs contain secrets, proprietary content, or personal data, this can result in unintended disclosure. In this skill context, outputs under evaluation may plausibly include exactly that kind of sensitive material.

Vague Triggers

Medium
Confidence
84% confidence
Finding
The task suite repeatedly invokes `/tech-article` as a broad, unscoped command without documenting when it should or should not activate. In an agent environment, overly broad triggers can cause accidental routing to the wrong skill, mis-handle unrelated user input, or let prompt patterns unintentionally activate article-generation behavior in contexts where it is not appropriate.

Unvalidated Output Injection

High
Category
Output Handling
Content
prompt_file = Path(tmpdir) / "prompt.txt"
            prompt_file.write_text(prompt, encoding="utf-8")

            result = subprocess.run(
                ["claude", "-p", "--output-format", "json"],
                input=prompt,
                capture_output=True,
Confidence
89% confidence
Finding
subprocess.run( ["claude", "-p", "--output-format", "json"], input=prompt, capture_output

VirusTotal

67/67 vendors flagged this skill as clean.

View on VirusTotal

Static analysis

No suspicious patterns detected.