Install
openclaw skills install skill-evalscopeLLM evaluation & inference performance testing via the evalscope CLI. Translates natural language requests into evalscope commands for: (1) Model accuracy evaluation — runs 160+ benchmarks against local checkpoints or API endpoints (OpenAI-compatible, Anthropic, LiteLLM); (2) Performance stress testing — TTFT, TPOT, throughput, latency under configurable concurrency; (3) RAG evaluation — RAGAS quality metrics, MTEB embedding benchmarks, CLIP retrieval; (4) Benchmark discovery — list/filter/inspect benchmarks by tag. Trigger on: evaluate / benchmark / score a model, throughput / latency / QPS / stress test, find benchmarks, view results, 评测模型, 压测, 跑 benchmark, 性能测试, 查看评测结果, 有哪些评测集, RAG 评测, embedding 评测. Do NOT trigger for: model training / finetuning / deployment / serving requests.
openclaw skills install skill-evalscopeRead only the relevant reference file for the matched workflow — don't preload all of them.
| Workflow | When | Reference |
|---|---|---|
| Eval (accuracy) | evaluate / benchmark / score | eval-reference.md |
| Perf (stress test) | throughput / latency / QPS / perf | perf-reference.md |
| RAG Evaluation | RAG / embedding / retrieval quality | rag-reference.md |
| Visualization | view results / compare / dashboard | (below) |
| Benchmark Discovery | list / find / what benchmarks | (below) |
| Troubleshooting | errors / failures / debug | troubleshooting.md |
evalscope --version # verify installation
pip install evalscope # basic
pip install 'evalscope[all]' # all backends (perf, rag, service, aigc)
pip install 'evalscope[perf]' # perf only
pip install 'evalscope[rag]' # RAG only (RAGAS, MTEB, CLIP)
pip install 'evalscope[service]' # Web dashboard
--model PATH (auto llm_ckpt)--model NAME --api-url URL (auto openai_api)--eval-type anthropic_api--eval-type litellm --model provider/name--eval-type openai_responses_api--model mock --eval-type mock_llm--eval-type text2image--eval-type text2speech--eval-type image_editingevalscope perf workflowopenai (default), local, local_vllm, dashscope, embedding, rerank, customevalscope eval --eval-backend RAGEval with tool configevalscope serviceevalscope benchmark-infoCore command pattern:
# Local checkpoint
evalscope eval --model Qwen/Qwen2.5-0.5B-Instruct --datasets gsm8k --limit 10
# API endpoint (auto-detects openai_api when --api-url is set)
evalscope eval --model qwen-plus --datasets gsm8k arc \
--api-url http://localhost:8000/v1/chat/completions --api-key sk-xxx --limit 10
# Anthropic
evalscope eval --model claude-3-5-sonnet --eval-type anthropic_api --datasets mmlu --api-key sk-ant-xxx
Key parameters: --datasets, --limit, --generation-config, --dataset-args, --eval-backend, --judge-strategy. For full parameter list → eval-reference.md.
Output: outputs/<timestamp>/reports/*.json (scores), report.html (summary).
Core command pattern:
# Basic throughput test
evalscope perf --model qwen-plus \
--url http://localhost:8000/v1/chat/completions --api openai \
--dataset openqa --parallel 5 --number 200 --stream
# Concurrency gradient (--parallel and --number must pair)
evalscope perf --model qwen-plus --url http://localhost:8000/v1/chat/completions \
--api openai --parallel 1 5 10 20 --number 50 250 500 1000 --stream
# Embedding model
evalscope perf --model text-embedding-v3 --url http://localhost:8000/v1/embeddings \
--api embedding --parallel 10 --number 500
# Rerank model
evalscope perf --model bge-reranker --url http://localhost:8000/v1/rerank \
--api rerank --parallel 5 --number 200
Key parameters: --parallel, --number, --dataset, --max-tokens, --sla-auto-tune. For full parameter list → perf-reference.md.
Output: console table (TTFT/TPOT/throughput p50-p99) + HTML report.
Uses --eval-backend RAGEval with a Python dict/YAML config. Three tools: RAGAS, MTEB, clip_benchmark.
from evalscope import run_task
run_task({
'eval_backend': 'RAGEval',
'eval_config': {
'tool': 'MTEB', # or 'RAGAS' or 'clip_benchmark'
...
}
})
For config schemas and examples → rag-reference.md.
evalscope service --host 0.0.0.0 --port 9000 --outputs ./outputs
Options: --host (default 0.0.0.0), --port (default 9000), --outputs PATH (scan dir), --debug.
Requires: pip install 'evalscope[service]'.
evalscope benchmark-info --list # all benchmarks
evalscope benchmark-info --list --tag Math Coding # filter by tags (OR, case-insensitive)
evalscope benchmark-info gsm8k # text summary
evalscope benchmark-info gsm8k --format json # structured JSON
evalscope benchmark-info gsm8k --format markdown # full docs
For code-execution benchmarks (HumanEval, MBPP, etc.) with Docker isolation:
evalscope eval --model qwen-plus --datasets humaneval \
--api-url http://localhost:8000/v1/chat/completions \
--sandbox '{"enabled": true, "type": "docker"}'
Requires Docker daemon running. See evalscope eval --help for --sandbox schema.
For up-to-date results: evalscope benchmark-info --list --tag <TAG>
| User Need | Tags | Typical Benchmarks |
|---|---|---|
| Math / reasoning | Math, Reasoning | gsm8k, math_500, aime24, competition_math |
| Coding | Coding | humaneval, mbpp, live_code_bench |
| General knowledge | Knowledge, MCQ | mmlu, ceval, cmmlu, mmlu_pro |
| Chinese | Chinese | ceval, cmmlu, chinese_simpleqa |
| Multimodal / vision | MultiModal | mmmu, mm_bench, math_vista |
| Instruction following | InstructionFollowing | ifeval, multi_if |
| Function calling | FunctionCalling | bfcl_v3, bfcl_v4 |
| Long context | LongContext | needle_haystack, longbench_v2 |
| Agent | Agent | tau_bench |
Common suites:
mmlu gsm8k bbh humaneval ifevalceval cmmlu chinese_simpleqammmu mm_bench math_vista mm_star--limit 5 for first-run validation./outputs/<timestamp>/tail -f outputs/<timestamp>/logs/eval_log.log--use-cache outputs/<previous_timestamp>evalscope eval --help / evalscope perf --help