EvalScope
ReviewAudited by ClawScan on May 1, 2026.
Overview
EvalScope is a coherent instruction-only skill for building and running evalscope benchmark commands, with expected but user-noticeable use of CLI installs, API keys, network load, and local result files.
This skill appears safe and purpose-aligned for EvalScope benchmarking. Before installing or using it, review any pip install command, run it in an isolated environment if possible, confirm benchmark size and concurrency, protect API keys, and be aware that outputs may contain prompts, predictions, logs, and reports.
Findings (6)
Artifact-based informational review of SKILL.md, metadata, install specs, static scan signals, and capability signals. ClawScan does not execute the skill or run runtime probes.
A benchmark could consume compute, disk, API quota, or run for hours if approved.
The skill explicitly runs evalscope CLI commands and may start long-running benchmark jobs. This is central to the skill’s purpose and includes confirmation, but users should review resource impact before approving.
1. Show the full command to the user for confirmation 2. Execute the command - For quick tests (`--limit` <= 20): run in foreground - For full evaluations: run in background (may take hours)
Confirm the exact command, dataset size, output path, and whether the run should be foreground or background before execution.
Stress tests can affect service availability or create unexpected API costs.
Performance benchmarking intentionally sends concurrent requests and can generate load. This is purpose-aligned, but misconfigured values could overload a model service or consume quota.
`--parallel` | int (multiple) | 1 | Number of concurrent requests... `-n, --number` | int (multiple) | 1000 | Total number of requests
Only run perf tests against systems you own or are authorized to test, and start with low concurrency/request counts.
API keys entered into commands may grant access to paid or private model services.
The skill supports API credentials for model and judge endpoints. This is expected for evaluating hosted models, and the artifacts show placeholders rather than hardcoded secrets.
`--api-key` | str | `EMPTY` | API authentication key
Use least-privilege keys, avoid pasting real secrets into shared logs, and prefer environment-variable or secret-manager handling if supported by your workflow.
Installing unpinned packages can pull newer dependencies than expected.
The skill provides user-directed package installation commands without version pinning. This is normal for CLI setup documentation, but package provenance and version should be reviewed.
pip install evalscope ... pip install 'evalscope[all]'
Install from trusted package sources, consider pinning a known-good evalscope version, and use an isolated virtual environment.
Private prompts, model outputs, or evaluation details may remain on disk after the run.
Evaluation runs persist prompts, predictions, reviews, logs, and reports locally. This is expected output for benchmarking, but it may include sensitive test data or model responses.
outputs/<timestamp>/ configs/task_config.yaml logs/eval_log.log predictions/<model>/*.jsonl reviews/<model>/*.jsonl reports/<model>/*.json
Choose an appropriate work directory, avoid sensitive datasets unless intended, and clean or protect output folders when finished.
Benchmark prompts and model answers may be shared with the configured judge provider.
The skill can route evaluation judging through another LLM/provider endpoint. This is documented and purpose-aligned, but it means evaluation content may be sent to an external service.
--judge-strategy llm ... --judge-model-args '{"model": "gpt-4o", "api_url": "https://api.openai.com/v1/chat/completions", "api_key": "sk-xxx"}'Use judge endpoints that are approved for your data, and avoid external judging for confidential datasets unless permitted.
