EvalScope

API key required
Dev Tools

LLM evaluation & inference performance testing via the evalscope CLI. Translates natural language requests into evalscope commands for: (1) Model accuracy evaluation — runs 160+ benchmarks against local checkpoints or API endpoints (OpenAI-compatible, Anthropic, LiteLLM); (2) Performance stress testing — TTFT, TPOT, throughput, latency under configurable concurrency; (3) RAG evaluation — RAGAS quality metrics, MTEB embedding benchmarks, CLIP retrieval; (4) Benchmark discovery — list/filter/inspect benchmarks by tag. Trigger on: evaluate / benchmark / score a model, throughput / latency / QPS / stress test, find benchmarks, view results, 评测模型, 压测, 跑 benchmark, 性能测试, 查看评测结果, 有哪些评测集, RAG 评测, embedding 评测. Do NOT trigger for: model training / finetuning / deployment / serving requests.

Install

openclaw skills install skill-evalscope

EvalScope

Read only the relevant reference file for the matched workflow — don't preload all of them.

WorkflowWhenReference
Eval (accuracy)evaluate / benchmark / scoreeval-reference.md
Perf (stress test)throughput / latency / QPS / perfperf-reference.md
RAG EvaluationRAG / embedding / retrieval qualityrag-reference.md
Visualizationview results / compare / dashboard(below)
Benchmark Discoverylist / find / what benchmarks(below)
Troubleshootingerrors / failures / debugtroubleshooting.md

Prerequisites

evalscope --version            # verify installation
pip install evalscope           # basic
pip install 'evalscope[all]'   # all backends (perf, rag, service, aigc)
pip install 'evalscope[perf]'  # perf only
pip install 'evalscope[rag]'   # RAG only (RAGAS, MTEB, CLIP)
pip install 'evalscope[service]' # Web dashboard

Decision Tree

  • User wants accuracy evaluation (evaluate / benchmark / score / 评测)
    • Local checkpoint path or HuggingFace/ModelScope ID → --model PATH (auto llm_ckpt)
    • API endpoint → --model NAME --api-url URL (auto openai_api)
    • Anthropic → --eval-type anthropic_api
    • LiteLLM multi-provider → --eval-type litellm --model provider/name
    • OpenAI Responses API → --eval-type openai_responses_api
    • Pipeline test → --model mock --eval-type mock_llm
    • Image generation → --eval-type text2image
    • TTS → --eval-type text2speech
    • Image editing → --eval-type image_editing
  • User wants performance test (throughput / latency / QPS / 压测)
    • evalscope perf workflow
    • API types: openai (default), local, local_vllm, dashscope, embedding, rerank, custom
  • User wants RAG evaluation (RAG / embedding quality / retrieval)
    • evalscope eval --eval-backend RAGEval with tool config
  • User wants visualization (view / compare / dashboard)
    • evalscope service
  • User wants benchmark info (list / find / what benchmarks / 有哪些评测集)
    • evalscope benchmark-info

Workflow 1: Eval (Accuracy)

Core command pattern:

# Local checkpoint
evalscope eval --model Qwen/Qwen2.5-0.5B-Instruct --datasets gsm8k --limit 10

# API endpoint (auto-detects openai_api when --api-url is set)
evalscope eval --model qwen-plus --datasets gsm8k arc \
  --api-url http://localhost:8000/v1/chat/completions --api-key sk-xxx --limit 10

# Anthropic
evalscope eval --model claude-3-5-sonnet --eval-type anthropic_api --datasets mmlu --api-key sk-ant-xxx

Key parameters: --datasets, --limit, --generation-config, --dataset-args, --eval-backend, --judge-strategy. For full parameter list → eval-reference.md.

Output: outputs/<timestamp>/reports/*.json (scores), report.html (summary).

Workflow 2: Perf (Stress Test)

Core command pattern:

# Basic throughput test
evalscope perf --model qwen-plus \
  --url http://localhost:8000/v1/chat/completions --api openai \
  --dataset openqa --parallel 5 --number 200 --stream

# Concurrency gradient (--parallel and --number must pair)
evalscope perf --model qwen-plus --url http://localhost:8000/v1/chat/completions \
  --api openai --parallel 1 5 10 20 --number 50 250 500 1000 --stream

# Embedding model
evalscope perf --model text-embedding-v3 --url http://localhost:8000/v1/embeddings \
  --api embedding --parallel 10 --number 500

# Rerank model
evalscope perf --model bge-reranker --url http://localhost:8000/v1/rerank \
  --api rerank --parallel 5 --number 200

Key parameters: --parallel, --number, --dataset, --max-tokens, --sla-auto-tune. For full parameter list → perf-reference.md.

Output: console table (TTFT/TPOT/throughput p50-p99) + HTML report.

Workflow 3: RAG Evaluation

Uses --eval-backend RAGEval with a Python dict/YAML config. Three tools: RAGAS, MTEB, clip_benchmark.

from evalscope import run_task
run_task({
    'eval_backend': 'RAGEval',
    'eval_config': {
        'tool': 'MTEB',  # or 'RAGAS' or 'clip_benchmark'
        ...
    }
})

For config schemas and examples → rag-reference.md.

Workflow 4: Visualization

evalscope service --host 0.0.0.0 --port 9000 --outputs ./outputs

Options: --host (default 0.0.0.0), --port (default 9000), --outputs PATH (scan dir), --debug.

Requires: pip install 'evalscope[service]'.

Workflow 5: Benchmark Discovery

evalscope benchmark-info --list                    # all benchmarks
evalscope benchmark-info --list --tag Math Coding  # filter by tags (OR, case-insensitive)
evalscope benchmark-info gsm8k                     # text summary
evalscope benchmark-info gsm8k --format json       # structured JSON
evalscope benchmark-info gsm8k --format markdown   # full docs

Workflow 6: Sandbox Evaluation

For code-execution benchmarks (HumanEval, MBPP, etc.) with Docker isolation:

evalscope eval --model qwen-plus --datasets humaneval \
  --api-url http://localhost:8000/v1/chat/completions \
  --sandbox '{"enabled": true, "type": "docker"}'

Requires Docker daemon running. See evalscope eval --help for --sandbox schema.

Quick Lookup Table

For up-to-date results: evalscope benchmark-info --list --tag <TAG>

User NeedTagsTypical Benchmarks
Math / reasoningMath, Reasoninggsm8k, math_500, aime24, competition_math
CodingCodinghumaneval, mbpp, live_code_bench
General knowledgeKnowledge, MCQmmlu, ceval, cmmlu, mmlu_pro
ChineseChineseceval, cmmlu, chinese_simpleqa
Multimodal / visionMultiModalmmmu, mm_bench, math_vista
Instruction followingInstructionFollowingifeval, multi_if
Function callingFunctionCallingbfcl_v3, bfcl_v4
Long contextLongContextneedle_haystack, longbench_v2
AgentAgenttau_bench

Common suites:

  • General LLM: mmlu gsm8k bbh humaneval ifeval
  • Chinese: ceval cmmlu chinese_simpleqa
  • Multimodal: mmmu mm_bench math_vista mm_star

General Notes

  • Always --limit 5 for first-run validation
  • Default output: ./outputs/<timestamp>/
  • Long runs: background + tail -f outputs/<timestamp>/logs/eval_log.log
  • Resume interrupted runs: --use-cache outputs/<previous_timestamp>
  • Full parameter help: evalscope eval --help / evalscope perf --help
  • On errors → troubleshooting.md