EvalScope

ReviewAudited by ClawScan on May 1, 2026.

Overview

EvalScope is a coherent instruction-only skill for building and running evalscope benchmark commands, with expected but user-noticeable use of CLI installs, API keys, network load, and local result files.

This skill appears safe and purpose-aligned for EvalScope benchmarking. Before installing or using it, review any pip install command, run it in an isolated environment if possible, confirm benchmark size and concurrency, protect API keys, and be aware that outputs may contain prompts, predictions, logs, and reports.

Findings (6)

Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.

Low

#ASI02: Tool Misuse and Exploitation

What this means

A benchmark could consume compute, disk, API quota, or run for hours if approved.

Why it was flagged

The skill explicitly runs evalscope CLI commands and may start long-running benchmark jobs. This is central to the skill’s purpose and includes confirmation, but users should review resource impact before approving.

Skill content

1. Show the full command to the user for confirmation
2. Execute the command
   - For quick tests (`--limit` <= 20): run in foreground
   - For full evaluations: run in background (may take hours)

Recommendation

Confirm the exact command, dataset size, output path, and whether the run should be foreground or background before execution.

Medium

#ASI02: Tool Misuse and Exploitation

What this means

Stress tests can affect service availability or create unexpected API costs.

Why it was flagged

Performance benchmarking intentionally sends concurrent requests and can generate load. This is purpose-aligned, but misconfigured values could overload a model service or consume quota.

Skill content

`--parallel` | int (multiple) | 1 | Number of concurrent requests... `-n, --number` | int (multiple) | 1000 | Total number of requests

Recommendation

Only run perf tests against systems you own or are authorized to test, and start with low concurrency/request counts.

Low

#ASI03: Identity and Privilege Abuse

What this means

API keys entered into commands may grant access to paid or private model services.

Why it was flagged

The skill supports API credentials for model and judge endpoints. This is expected for evaluating hosted models, and the artifacts show placeholders rather than hardcoded secrets.

Skill content

`--api-key` | str | `EMPTY` | API authentication key

Recommendation

Use least-privilege keys, avoid pasting real secrets into shared logs, and prefer environment-variable or secret-manager handling if supported by your workflow.

Low

#ASI04: Agentic Supply Chain Vulnerabilities

What this means

Installing unpinned packages can pull newer dependencies than expected.

Why it was flagged

The skill provides user-directed package installation commands without version pinning. This is normal for CLI setup documentation, but package provenance and version should be reviewed.

Skill content

pip install evalscope
...
pip install 'evalscope[all]'

Recommendation

Install from trusted package sources, consider pinning a known-good evalscope version, and use an isolated virtual environment.

Low

#ASI06: Memory and Context Poisoning

What this means

Private prompts, model outputs, or evaluation details may remain on disk after the run.

Why it was flagged

Evaluation runs persist prompts, predictions, reviews, logs, and reports locally. This is expected output for benchmarking, but it may include sensitive test data or model responses.

Skill content

outputs/<timestamp>/
  configs/task_config.yaml
  logs/eval_log.log
  predictions/<model>/*.jsonl
  reviews/<model>/*.jsonl
  reports/<model>/*.json

Recommendation

Choose an appropriate work directory, avoid sensitive datasets unless intended, and clean or protect output folders when finished.

Low

#ASI07: Insecure Inter-Agent Communication

What this means

Benchmark prompts and model answers may be shared with the configured judge provider.

Why it was flagged

The skill can route evaluation judging through another LLM/provider endpoint. This is documented and purpose-aligned, but it means evaluation content may be sent to an external service.

Skill content

--judge-strategy llm ... --judge-model-args '{"model": "gpt-4o", "api_url": "https://api.openai.com/v1/chat/completions", "api_key": "sk-xxx"}'

Recommendation

Use judge endpoints that are approved for your data, and avoid external judging for confidential datasets unless permitted.