Install
openclaw skills install skill-evalAutonomous engine that systematically evaluates and ranks agent skills across models using rubric grading, error taxonomy, and improvement feedback loops.
openclaw skills install skill-evalAn autonomous evaluation system for agent skills from ClawHub and other registries. Produces HuggingFace model-card style reports and a ranked leaderboard.
Informed by:
skill_eval/
VERSION -- engine semver
SKILL-EVAL.md -- this file (the brain)
knowledge/
lessons.md -- accumulated eval wisdom
eval-patterns.md -- reusable test/assertion templates
failures.md -- failure mode catalog
skill-profiles/<slug>.md -- per-skill learned context
references/ -- source articles and frameworks
improve/ -- skill-improvement engine knowledge (NEW v0.4.0)
lessons.md -- improvement-specific lessons learned
patterns.md -- proven improvement patterns by category
failures.md -- improvement failure modes
skill-cards/ -- output: one .md per evaluation
leaderboard/
index.html -- interactive HTML leaderboard
scripts/
generate_skill_card.py -- skill card generator
generate_leaderboard.py -- leaderboard builder
evals/
skill-registry.json -- skills to evaluate
<slug>.json -- per-skill eval config
workspaces/ -- per-skill eval workspaces
A skill is valuable if and only if it makes the agent produce measurably better results than the agent would produce without it. "Better" means:
A skill that produces identical results to baseline but costs 3x more tokens is a net negative. A skill that improves quality dramatically but takes 2x longer is likely worth it.
Understanding which type a skill is affects how we design assertions.
Skills should work across models, not just the one used to test them. The engine supports configuring different models for different roles.
There are three distinct model roles in the evaluation pipeline:
Model configuration lives in evals/models.json:
{
"execution_models": [
"anthropic/claude-opus-4-6",
"openai/gpt-4.1",
"google/gemini-2.5-pro"
],
"judge_model": "anthropic/claude-opus-4-6",
"improvement_model": "anthropic/claude-opus-4-6",
"default_execution_model": "anthropic/claude-opus-4-6"
}
Individual eval configs (evals/<slug>.json) can override the global model config:
{
"skill_slug": "explain-code",
"models": {
"execution": ["anthropic/claude-opus-4-6", "openai/gpt-4.1"],
"judge": "google/gemini-2.5-pro"
},
"evals": [...]
}
If models is omitted, the global evals/models.json config is used.
When a skill is evaluated across multiple models, the skill card includes:
The leaderboard shows the aggregate score by default, with expandable per-model details. Skills that show consistent value across models rank higher than skills that only help one model.
Not all models may be available in every environment. The engine handles this gracefully:
default_execution_model as judgeBefore generating test cases, understand the skill:
dependency-gated in evals/<slug>.json and the benchmark. Do not run the eval -- it will produce environment failures, not skill-quality signals. Re-evaluate after credential provisioning.scrape_reviews.py with no actual file), flag as phantom-tooling in the skill card. The skill's framework/template value can still be evaluated, but users should know the tooling is vaporware. (Learned from review-summarizer eval, Batch 3.)unsubstantiated-claims in the skill card. Do not use the skill's self-reported numbers in scoring. (Learned from debug-checklist eval, Batch 3.)phantom-tooling: true and split evaluation into (a) framework/template value and (b) operational tooling value.knowledge/lessons.md, eval-patterns.md, failures.md for relevant patternsknowledge/skill-profiles/<slug>.mdDesign 2-3 test prompts following OpenAI's four-category framework:
Success categories to check:
Prompt design principles:
Assertion design (two layers):
Layer 1: Deterministic checks
Layer 2: Rubric-based quality assessment
Assertion anti-patterns (from lessons learned):
Output-floor assertions (from failure modes):
Category-specific assertion patterns:
keyword_absent assertions for each banned word. These are deterministic, easy to verify, and produce maximum delta. (Learned from Batch 3 -- article-writer scored 10/10 with 100% delta, the first perfect score.)phantom-tooling when scripts are missing.Save test cases to evals/<slug>.json.
For each test case, determine the execution model(s) from the eval config or evals/models.json.
Single-model mode (default): Spawn two subagents simultaneously on the same execution model:
With-skill subagent:
[Model: <execution_model>]
Read the skill at <skill-path>/SKILL.md and follow its instructions.
Task: <prompt>
Save all outputs to: <workspace>/iteration-<N>/<test-name>/with_skill/outputs/
Without-skill (baseline) subagent:
[Model: <execution_model>]
Complete this task using only your built-in capabilities. Do NOT read any SKILL.md.
Task: <prompt>
Save all outputs to: <workspace>/iteration-<N>/<test-name>/without_skill/outputs/
Cross-model mode: When multiple execution models are configured, run the full with/without pair for EACH model. Organize outputs by model:
<workspace>/iteration-<N>/<test-name>/<model-slug>/with_skill/outputs/
<workspace>/iteration-<N>/<test-name>/<model-slug>/without_skill/outputs/
Capture timing data (tokens, duration, model used) from completion events into timing.json.
Grade each run against assertions. Two approaches:
Programmatic grading (preferred for deterministic checks):
LLM-based grading (for qualitative assessments):
evals/models.json or per-skill override), NOT the execution model{"text": "...", "passed": bool, "evidence": "..."}judge_model in grading output for attributionSave to grading.json:
{
"expectations": [
{"text": "assertion text", "passed": true, "evidence": "why this passed/failed"}
],
"summary": {"passed": N, "failed": N, "total": N, "pass_rate": 0.X}
}
Create benchmark.json with:
dependency-gated so it doesn't pollute rankings with environment failures.phantom-tooling and report separate judgments for framework quality vs operational readiness.python scripts/generate_skill_card.py \
--workspace workspaces/<slug>/iteration-<N> \
--skill-name "<Name>" \
--skill-slug "<slug>" \
--eval-model "claude-opus-4-6" \
--output skill-cards/<slug>-v<VERSION>.md
Each card includes:
python scripts/generate_leaderboard.py \
--cards-dir skill-cards \
--output leaderboard/index.html
After each evaluation batch, update the knowledge base:
Key questions for the learning step:
This is the critical closing step. Without it, the engine documents lessons but doesn't actually evolve.
After updating the knowledge files, review them and fold actionable improvements back into this document:
When enough knowledge accumulates, bump VERSION. The version bump signals that the methodology itself has changed, not just the knowledge base.
The loop: eval -> knowledge -> SKILL-EVAL.md -> better evals. If knowledge doesn't flow back up, the engine isn't self-evolving.
Trigger: Score < 7 (verdict = "Conditional", "Marginal", or "Not Recommended"), AND the skill is not dependency-gated.
The Skill Improvement Engine is itself a self-evolving system with its own knowledge base, learned patterns, and failure catalog. It gets better at improving skills over time.
Located at knowledge/improve/:
lessons.md — What improvement strategies worked? What didn't? Which root causes are hardest to fix?patterns.md — Proven improvement patterns by skill category (e.g., "for reference-manual skills, delete 70%+ content and add MUST/ALWAYS/NEVER mandates")failures.md — Improvement failure modes: cases where improvement was attempted but didn't produce meaningful score gains, with root cause analysisBefore improving any skill, read all three files. The improvement engine should never repeat a failed strategy or miss a proven pattern.
Read the improvement knowledge base:
knowledge/improve/lessons.md — proven strategies, anti-patternsknowledge/improve/patterns.md — category-specific improvement playbooksknowledge/improve/failures.md — what NOT to try, and whyknowledge/lessons.md, eval-patterns.md, failures.mdRead the eval data:
benchmark.json (what the skill got wrong)knowledge/skill-profiles/<slug>.mdknowledge/eval-patterns.mdDiagnose root causes (check against known patterns):
knowledge/improve/patterns.md for category-matched strategiesSelect improvement strategy from knowledge base:
knowledge/improve/patterns.mdknowledge/improve/failures.md), try a different approach or document why this case is differentRewrite SKILL.md:
skills-under-test/<slug>/SKILL-improved.mdUpdate assertions to match improved skill:
Document changes:
skills-under-test/<slug>/IMPROVEMENT-LOG.mdWhat NOT to improve:
dependency-gated skills (problem is environment, not skill quality)Model selection for improvement: Use the configured improvement_model from evals/models.json. Different models may bring different improvement perspectives -- a model that didn't write the original skill may see blind spots the original author (or model) missed.
Run the exact same eval config (evals/<slug>.json) against the improved SKILL.md, with updated assertions where applicable.
SKILL-improved.md instead of original SKILL.mdworkspaces/<slug>/iteration-<N+1>/skill-cards/<slug>-v<VERSION>-improved.md:
knowledge/improve/patterns.md)Success criteria:
If improvement fails (score doesn't meaningfully improve):
improvement-attempted in registryThis is the critical step that makes the improvement engine self-evolving.
After each improvement batch (Phase 10-11), update the improvement knowledge base:
Update knowledge/improve/lessons.md:
Update knowledge/improve/patterns.md:
Category -> Root Cause -> Strategy -> Expected GainReference Manual -> Redundant content -> Delete 70%, add MUST/ALWAYS/NEVER -> +1.5 to +2.0 pointsUpdate knowledge/improve/failures.md:
Absorb into Phase 10:
The improvement loop: improve -> re-eval -> learn -> better improvements. If improvement lessons don't flow back, the improvement engine is static.
evals/models.json config with three model roles: execution, judge, improvement. Skills can now be evaluated across multiple models for cross-model consistency. Per-skill model overrides supported in eval configs.knowledge/improve/ with lessons.md, patterns.md, and failures.md. Before improving any skill, the engine reads its learned patterns, selects a strategy, and documents results. After each improvement batch, Phase 12 updates the improvement knowledge base -- the improvement engine evolves independently from the eval engine.Overall Score: 0-10
| Component | Points | Criteria |
|---|---|---|
| Quality | 0-5 | Based on with-skill pass rate |
| Value-add | 0-3 | Delta between with-skill and without-skill pass rates |
| Efficiency | 0-2 | Time/token overhead relative to baseline |
| Score | Verdict | Meaning |
|---|---|---|
| 7-10 | Recommended | Clear value over baseline |
| 5-6.9 | Conditional | Some value with trade-offs |
| 3-4.9 | Marginal | Overhead without proportional improvement |
| 0-2.9 | Not Recommended | Baseline is comparable or better |
VERSION (semver)evals/<slug>.json are versioned implicitly through gitevals/skill-registry.json and evals/models.json