Security audit

Improvement Evaluator

Security checks across malware telemetry and agentic risk

Overview

This skill is a disclosed execution evaluator that runs task suites through Claude and optional pytest checks; its behavior matches its purpose but should be used with trusted inputs.

Install only if you intend to run execution-based evaluations. Use --mock for dry runs, review task suites and fixtures before running them, avoid putting secrets or private customer data in candidate skills or prompts, and run pytest-based judges in a sandbox when fixtures are not fully trusted.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Output HandlingUnvalidated Output Injection, Cross-Context Output, Unbounded Output
Trigger AbuseOverly Broad Trigger, Shadow Command Trigger, Keyword Baiting Trigger
Behavioral ASTexec() Call, eval() Call, Dynamic Import

Findings (9)

subprocess module call

Medium

Category: Dangerous Code Execution
Content: output_file = Path(tmpdir) / "ai_output.txt" output_file.write_text(output, encoding="utf-8") result = subprocess.run( ["python3", "-m", "pytest", str(test_path), "-v", "--tb=short", f"--rootdir={tmpdir}"], capture_output=True,
Confidence: 88% confidence
Finding: result = subprocess.run( ["python3", "-m", "pytest", str(test_path), "-v", "--tb=short", f"--rootdir={tmpdir}"], capture_output=True,

subprocess module call

Medium

Category: Dangerous Code Execution
Content: f"Respond with ONLY a JSON object: {{\"score\": <float>, \"reasoning\": \"<str>\"}}" ) try: result = subprocess.run( ["claude", "-p", "--output-format", "json"], input=prompt, capture_output=True,
Confidence: 91% confidence
Finding: result = subprocess.run( ["claude", "-p", "--output-format", "json"], input=prompt, capture_output=True, text=True,

Context-Inappropriate Capability

Medium

Confidence: 86% confidence
Finding: The skill's stated purpose is evaluation, but this implementation achieves that by executing pytest-based test logic, which is effectively arbitrary Python code execution. In skill context, that is more dangerous because evaluators may be assumed to be passive/scoring-only components, while this one can run powerful code paths from test fixtures and plugin hooks.

Context-Inappropriate Capability

Medium

Confidence: 84% confidence
Finding: This judge expands the skill from local pass/fail evaluation into invocation of an external model CLI. That broadens the operational and trust boundary beyond the manifest description and can expose data or create non-deterministic behavior not obvious to users of the skill. The context increases concern because evaluators are often expected to be reproducible and self-contained.

Description-Behavior Mismatch

High

Confidence: 97% confidence
Finding: This task suite is materially misaligned with the declared purpose of the improvement-evaluator skill. Instead of measuring whether skill changes improve execution pass rate on predefined tasks, it evaluates a different skill's gate-ordering, decision logic, and review workflows, which can cause the evaluator to certify the wrong capabilities and produce misleading pass/fail evidence.

Context-Inappropriate Capability

Medium

Confidence: 90% confidence
Finding: The suite tests gate-validation and human-review CLI operations that are outside the stated scope of an execution-evaluation skill. This broadens the skill's apparent authority and can train or validate the agent to perform operational workflow actions unrelated to evaluation, increasing the chance of misuse, confused-deputy behavior, or over-privileged integrations being exercised under the wrong skill boundary.

Missing User Warnings

Medium

Confidence: 90% confidence
Finding: The code sends rubric content and AI output to an external CLI without any visible notice, consent, or redaction step. If evaluated outputs contain secrets, proprietary content, or personal data, this can result in unintended disclosure. In this skill context, outputs under evaluation may plausibly include exactly that kind of sensitive material.

Vague Triggers

Medium

Confidence: 84% confidence
Finding: The task suite repeatedly invokes `/tech-article` as a broad, unscoped command without documenting when it should or should not activate. In an agent environment, overly broad triggers can cause accidental routing to the wrong skill, mis-handle unrelated user input, or let prompt patterns unintentionally activate article-generation behavior in contexts where it is not appropriate.

Unvalidated Output Injection

High

Category: Output Handling
Content: prompt_file = Path(tmpdir) / "prompt.txt" prompt_file.write_text(prompt, encoding="utf-8") result = subprocess.run( ["claude", "-p", "--output-format", "json"], input=prompt, capture_output=True,
Confidence: 89% confidence
Finding: subprocess.run( ["claude", "-p", "--output-format", "json"], input=prompt, capture_output

VirusTotal

67/67 vendors flagged this skill as clean.

View on VirusTotal

Static analysis

No suspicious patterns detected.