Install
openclaw skills install benchmarkingEvaluate and compare models or providers on real-work tasks by creating, running, and expanding benchmarks that assess tool choice, failure recovery, and pro...
openclaw skills install benchmarkingUse this skill when you need to:
Benchmark operator leverage, not just output prettiness.
A good benchmark should tell you:
Use when you need to create a benchmark or full suite.
Expected outputs:
README.mdtasks.jsonanswer-key.json or answer-key guidelinesrubric.mdjudge-notes.mdUse when you need to run models through an existing benchmark.
Expected outputs:
results-raw.jsonresults-scored.jsonREADME.mdUse when you want to make a benchmark harder or add new tracks. Do not reinvent baseline tasks unless needed.
Use these classes when interpreting results:
Before saying DONE, provide:
Use:
output/benchmarks/YYYY-MM-DD-<benchmark-name>/Every serious benchmark should produce most or all of:
README.mdtasks.jsonanswer-key.jsonrubric.mdjudge-notes.mdresults-raw.jsonresults-scored.jsonleague-table.png or infographicharness.py or equivalent scorer if automation existsA good benchmark changes routing decisions. If the result would not alter which model you use for real work, the benchmark is probably too soft.