Eval Skills

AdvisoryAudited by Static analysis on May 10, 2026.

Overview

Detected: suspicious.dangerous_exec, suspicious.dynamic_code_execution, suspicious.env_credential_access

Findings (5)

Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.

NoteMedium Confidence

ASI02: Tool Misuse and Exploitation

What this means

A skill you evaluate may run local code under the evaluator's permissions.

Why it was flagged

The framework launches configured skill subprocesses. This fits the stated evaluation purpose, but it means evaluated skills can execute code on the local machine unless containment is strong.

Skill content

const child = spawn(finalExec, finalArgs, {

Recommendation

Use Docker or another strong sandbox for untrusted skills, avoid running with sensitive environment variables, and review skill entrypoints before evaluation.

ConcernMedium Confidence

ASI03: Identity and Privilege Abuse

What this means

A mistaken or malicious HTTP skill configuration could cause a local secret to be used with a configured network endpoint.

Why it was flagged

The HTTP adapter reads a token from an environment variable chosen at runtime. In a framework that evaluates externally supplied skill configurations, this needs explicit allowlisting/approval; the registry metadata declares no env vars or primary credential.

Skill content

const token = process.env[envKey];

Recommendation

Document credential use in metadata, restrict which environment variables may be read, require explicit user approval before sending bearer tokens, and run evaluations in a clean environment.

NoteHigh Confidence

ASI05: Unexpected Code Execution

What this means

The example calculator should be treated as a demo, not as a hardened parser for untrusted input.

Why it was flagged

The bundled calculator example uses Python `eval` on an allowlisted expression. The allowlist reduces arbitrary code risk, but `eval` remains a sensitive pattern and could still allow expensive expressions.

Skill content

if all(c in allowed for c in expression):
            result = eval(expression)

Recommendation

Replace `eval` with a safe expression parser or impose strict complexity and resource limits if this example is used beyond tests.

NoteHigh Confidence

ASI04: Agentic Supply Chain Vulnerabilities

What this means

Installing from source may execute dependency/build scripts from the cloned project and its package ecosystem.

Why it was flagged

Manual installation runs code from an external Git repository and package-manager build steps. This is normal for a source-installed CLI, but users should verify provenance.

Skill content

git clone https://github.com/isLinXu/eval-skills.git
cd eval-skills
pnpm install && pnpm build

Recommendation

Verify the repository, prefer pinned/locked installs, and review dependency changes before running build or install commands.

NoteHigh Confidence

ASI06: Memory and Context Poisoning

What this means

Evaluation inputs, outputs, and reports may remain on disk after the run.

Why it was flagged

Evaluation results are persisted to a local SQLite database by default when using the store option. This is purpose-aligned but can retain evaluated task data.

Skill content

`--store <path>` | SQLite database path for persistent result storage | `./eval-skills.db`

Recommendation

Choose storage paths intentionally, avoid evaluating sensitive data unless needed, and delete or protect the database/reports when appropriate.

NoteMedium Confidence

ASI07: Insecure Inter-Agent Communication

What this means

Private task data or skill outputs may be shared with an LLM provider when LLM judging is enabled.

Why it was flagged

The LLM-judge scorer uses an LLM key, implying evaluation content may be sent to an LLM-backed service. This is expected for LLM judging but should be noticed for private benchmarks or outputs.

Skill content

### LLM Judge
...
Requires `EVAL_SKILLS_LLM_KEY` environment variable.

Recommendation

Use LLM judging only with data you are allowed to send externally, and confirm which provider/key is configured.

Findings (5)

critical

suspicious.dangerous_exec

Location: packages/cli/src/commands/__tests__/eval.test.ts:44
Finding: Shell command execution detected (child_process).

critical

suspicious.dangerous_exec

Location: packages/core/src/sandbox/ProcessSandbox.ts:85
Finding: Shell command execution detected (child_process).

critical

suspicious.dynamic_code_execution

Location: examples/skills/calculator/skill.py:31
Finding: Dynamic code execution detected.

critical

suspicious.dynamic_code_execution

Location: packages/core/src/sandbox/sandbox.integration.test.ts:333
Finding: Dynamic code execution detected.

critical

suspicious.env_credential_access

Location: packages/core/src/adapters/HttpAdapter.ts:82
Finding: Environment variable access combined with network send.