Install
openclaw skills install skill-evaluationEvaluate any AI skill's quality through step-by-step diagnosis — measuring trigger accuracy, per-step execution (completion/correctness/quality), efficiency, and safety — then produce a structured report with Bad Cases highlighted and actionable fixes. Supports iterative optimization with version control. Use this skill whenever someone wants to test a skill, evaluate prompt quality, benchmark a skill, diagnose why a skill underperforms, compare skill versions, check if a skill is production-ready, or get a quality assessment for any AI skill or prompt.
openclaw skills install skill-evaluationA diagnostic instrument for AI skills. Feed it any skill, get back a structured report that tells you exactly what's working, what's broken (Bad Cases), and what to fix — then iterate until the skill passes.
Most skill testing today is vibes-based: run a couple of examples, eyeball the output, ship it. Skill Eval treats evaluation as diagnosis:
Input Skill -> [Phase 0] Structure Assessment -> [Phase 1] Skill Dissection ->
[Phase 2] Test Case Design -> [Phase 3] Execute & Record ->
[Phase 4] Score & Verify -> [Phase 5] Report & Iterate
Evaluate testability before testing:
"step_source": "inferred"Read the target skill and build a structured profile:
Design test cases with per-step expected results. Each case has: a task prompt, input context, per-step expected results with check_types, and skill requirements references.
check_type options for must_contain:
"exact" — verified by code: value in output"regex" — verified by code: re.search(pattern, output)"semantic" — verified by LLM judgment (use only for abstract concepts)Rule: Use exact/regex whenever possible. Only use semantic for truly conceptual checks.
How many cases:
Critical rule: Expected results written BEFORE execution. Never adjust after seeing results.
See references/schemas.md for the complete test case JSON structure.
⚠️ Safety Boundary: Sandbox-First Execution
Before executing any test case against an untrusted skill:
If the target skill's operation types include web scraping, page manipulation, API calling,
or file output, these MUST be sandboxed or mocked. The evaluator observes behavior but does
NOT vouch for the safety of the target skill's actions.
Execution steps:
For each step, produce THREE independent scores:
Automated checks first (check_type: exact/regex) — deterministic, code-verified.
LLM scoring (semantic checks + quality judgments):
Rules:
low_score_reason with expected vs actualScoring Stability (Deep Eval mode):
See references/scoring.md for detailed scoring definitions and references/rubrics.md
for operation-type-specific rubrics.
Generate report in this priority order:
Then iterate: identify root causes, generate optimization plan, create new version, re-test.
See references/report-format.md for visual presentation formats and
references/schemas.md for report JSON structure.
A step is a Bad Case if ANY of:
All evaluation artifacts are versioned:
{workspace}/
skill/v1/SKILL.md
skill/v2/SKILL.md
test-cases/v1/cases.json
runs/run-{date}-v1/
reports/v1/report.json
optimizations/OPT-001.json
changelog.json
Rules:
Stop conditions: Bad Cases = 0, Correctness avg >= 1.8/2, no regressions, unsafe rate = 0%.
references/schemas.md — JSON schemas for all data structuresreferences/scoring.md — Scoring definitions and computation methodsreferences/rubrics.md — Reusable rubric templates by operation typereferences/report-format.md — Visual report presentation formatsagents/judge.md — The scoring agent protocolagents/advisor.md — The diagnostic advisor protocolscripts/score_engine.py — Score computation enginescripts/safety_scanner.py — Static safety analysisscripts/generate_scorecard.py — HTML report generationscripts/run_trigger_eval.py — Trigger evaluationThis skill runs on any AI coding assistant that supports skill/prompt loading:
| Platform | Skill Location | Trigger Mechanism |
|---|---|---|
| Claude Code | .claude/commands/ | claude -p CLI |
| Cursor | .cursor/rules/ or project rules | Agent mode invocation |
| Codex | System prompt / tool config | CLI or API invocation |
| OpenClaw | .claw/skills/ | Skill activation via hub |
The core evaluation logic (scoring, reporting, safety scanning) is platform-agnostic. Only the trigger evaluation script requires platform-specific adapters.
scripts/run_trigger_eval.py)
can launch platform-specific CLI commands to test skill activation. Pass --platform
to select the backend (default: auto-detect). Supported: claude, cursor, codex, openclaw.html.escape(). Nonetheless, review generated reports in a sandboxed browser context
when evaluating skills from unknown sources.