Eval Skills

Security checks across malware telemetry and agentic risk

Overview

This appears to be a real skill-evaluation framework, but it needs Review because normal use can execute local or remote skill code and custom scorer code with under-scoped trust boundaries.

Install only if you are comfortable running a tool that evaluates and may execute skill code. Use it in an isolated workspace, prefer Docker sandboxing with controlled network egress, avoid exposing broad API keys or secrets, review all skill entrypoints and custom scorer paths before running, and treat HTTP/MCP/LLM endpoints as places where benchmark data may leave your machine.

SkillSpector

By NVIDIA

Vulnerability Patterns

Data ExfiltrationExternal Transmission, Env Variable Harvesting, File System Enumeration
Supply ChainUnpinned Dependencies, External Script Fetching, Obfuscated Code
Excessive AgencyUnrestricted Tool Access, Autonomous Decision Making, Scope Creep
Behavioral ASTexec() Call, eval() Call, Dynamic Import
MCP Tool PoisoningHidden Instructions, Unicode Deception, Parameter Description Injection

Findings (18)

eval() call detected

High

Category: Dangerous Code Execution
Content: # Safe subset of operations allowed = set("0123456789+-*/.() ") if all(c in allowed for c in expression): result = eval(expression) return {"result": str(result)} else: return {"result": "error: invalid characters in expression"}
Confidence: 95% confidence
Finding: result = eval(expression)

Context-Inappropriate Capability

High

Confidence: 96% confidence
Finding: The adapter treats any non-HTTP(S) skill entrypoint as a local command and passes it to StdioClientTransport, which can spawn arbitrary executables with attacker-controlled arguments. In a framework that evaluates externally supplied skills, this expands the trust boundary from invoking skills to executing arbitrary local processes, enabling code execution, data access, and lateral movement on the host.

Intent-Code Divergence

Medium

Confidence: 96% confidence
Finding: The file advertises a GAIA benchmark that supposedly measures reasoning, tool use, and multimodality, but the actual tasks are trivial arithmetic exact-match checks. This creates a benchmark integrity issue: downstream users may believe a skill has been validated against a meaningful general-assistant benchmark when it has only passed toy tasks, enabling misleading quality claims and unsafe promotion to production.

Context-Inappropriate Capability

High

Confidence: 97% confidence
Finding: The code resolves a user-controlled filesystem path and dynamically imports that module inside a Worker, which still grants the imported code the same host-level capabilities available to the Node.js process such as filesystem, network, subprocess, and environment access. Worker isolation here only provides concurrency and limited memory control, not a security sandbox, so this enables arbitrary code execution from attacker-supplied scorer files.

Description-Behavior Mismatch

Medium

Confidence: 88% confidence
Finding: The embed() method sends arbitrary input text to OpenAI's embeddings API, which can include skill content, prompts, or evaluation data. In a unit-testing/evaluation framework, that creates a real data exposure risk if users assume processing is local or if sensitive benchmark content is passed through this path without explicit disclosure or consent.

Intent-Code Divergence

High

Confidence: 98% confidence
Finding: The file advertises strong Docker sandboxing guarantees, including seccomp filtering, but `loadSeccompProfile()` falls back to returning `"unconfined"` if the profile file is missing. That creates a fail-open security posture: deployments may believe syscall filtering is active when it is silently disabled, materially weakening containment for untrusted code executed by this skill.

Description-Behavior Mismatch

Medium

Confidence: 87% confidence
Finding: This monitor introduces side effects outside in-memory evaluation by writing violation data to disk and optionally transmitting alerts over the network. In a testing/evaluation framework, those behaviors can leak sensitive skill identifiers or violation details, create compliance/privacy issues, and violate assumptions that evaluation runs are isolated and non-exfiltrating.

Context-Inappropriate Capability

Medium

Confidence: 93% confidence
Finding: The webhook path performs outbound network communication using violation contents, which can exfiltrate internal metadata such as skill IDs, timestamps, and details to arbitrary URLs. In an evaluation framework, unsolicited network egress is especially risky because test harnesses are often expected to be deterministic, isolated, and safe to run on sensitive internal assets.

Intent-Code Divergence

Medium

Confidence: 97% confidence
Finding: The profile claims certain high-risk syscalls are explicitly forbidden, but at least some of them, such as ptrace and clock_adjtime, are already permitted earlier in the allowlist. In seccomp profiles, rule order and conflict semantics can make such contradictions dangerous because reviewers may believe protections exist when the sandbox actually allows powerful primitives that can aid process inspection, tampering, or broader sandbox escape chains.

Missing User Warnings

Medium

Confidence: 90% confidence
Finding: The eval command is presented as routine benchmark execution, but its documented behavior includes concurrent task execution, retries, and loading skills from local paths/directories without warning that those skills may perform arbitrary side effects. Because the product is specifically meant to run candidate skills before production, users are especially likely to evaluate untrusted or semi-trusted code, making silent code execution and side effects materially risky.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: The adapter documentation states that HTTP skills can send POST requests and use Bearer/API-Key auth from environment variables, but it does not warn that benchmark inputs and other task data may be transmitted to external endpoints or that sensitive credentials are in scope during execution. This can lead to unintended data exfiltration, credential misuse, or disclosure when users test third-party skills or misconfigured endpoints.

Missing User Warnings

Medium

Confidence: 87% confidence
Finding: This code launches a local subprocess based on the skill entrypoint without any consent, disclosure, or visible policy gate in the adapter. In the context of a skill evaluation framework, users may reasonably expect analysis and testing behavior, not unrestricted local process execution, which increases the risk of surprise execution of malicious binaries.

Missing User Warnings

Medium

Confidence: 83% confidence
Finding: The adapter automatically connects to any HTTP/HTTPS entrypoint via SSE without validating destination trustworthiness or warning the user. For a system that evaluates third-party skills, this can cause unintended outbound connections, metadata leakage, SSRF-like access to internal services, or interaction with attacker-controlled MCP endpoints.

Missing User Warnings

Medium

Confidence: 89% confidence
Finding: The implementation executes user-provided code as part of normal evaluation flow without any explicit warning, consent, or trust check, which can cause operators to treat evaluation inputs as data when they are actually executable code. In a skill-evaluation framework, this increases the likelihood of accidental execution of untrusted repository content and turns a quality-testing feature into an implicit code-execution surface.

Missing User Warnings

Medium

Confidence: 94% confidence
Finding: This code sends both the model output and the expected/reference value to an external LLM provider for scoring. In an evaluation framework, those fields can contain proprietary prompts, test cases, secrets, PII, or regulated data, and this file does not enforce consent, redaction, or any user-visible disclosure before transmission.

Missing User Warnings

Medium

Confidence: 91% confidence
Finding: This file performs network transmission of raw text inputs to a third-party API without any user-facing notice, guardrail, or consent mechanism in the implementation shown. That is dangerous because users may unknowingly send proprietary skills, prompts, or test cases outside their environment, creating confidentiality and compliance risks.

Missing User Warnings

Medium

Confidence: 92% confidence
Finding: In "auto" mode, the factory silently falls back from DockerSandbox to ProcessSandbox when Docker is unavailable. That weakens the isolation boundary without forcing explicit acknowledgment, so callers may believe they are running with strong container isolation while actually executing in a less isolated local process context, which is risky for a skill evaluation framework that may execute untrusted code.

Known Vulnerable Dependency: vitest==1.4.0 — 1 advisory(ies): CVE-2025-24964 (Vitest allows Remote Code Execution when accessing a malicious website while Vit)

Critical

Category: Supply Chain
Confidence: 98% confidence
Finding: vitest==1.4.0

VirusTotal

64/64 vendors flagged this skill as clean.

View on VirusTotal