skill-evaluation

Evaluate any AI skill's quality through step-by-step diagnosis — measuring trigger accuracy, per-step execution (completion/correctness/quality), efficiency, and safety — then produce a structured report with Bad Cases highlighted and actionable fixes. Supports iterative optimization with version control. Use this skill whenever someone wants to test a skill, evaluate prompt quality, benchmark a skill, diagnose why a skill underperforms, compare skill versions, check if a skill is production-ready, or get a quality assessment for any AI skill or prompt.

Audits

Pass

Install

openclaw skills install skill-evaluation

Skill Eval

A diagnostic instrument for AI skills. Feed it any skill, get back a structured report that tells you exactly what's working, what's broken (Bad Cases), and what to fix — then iterate until the skill passes.

Philosophy

Most skill testing today is vibes-based: run a couple of examples, eyeball the output, ship it. Skill Eval treats evaluation as diagnosis:

  1. Low-score system — 2-point or 3-point scales, not 100-point. Simple and honest.
  2. Three independent scores per step — Completion (0/1), Correctness (0/1/2), Execution Quality (0/1/2). Never combined into a weighted total.
  3. Bad Cases first — The report leads with failures, not averages.
  4. Iterative optimization — Test, find Bad Cases, fix the skill, re-test. Track versions.
  5. Expected results before execution — Every test case defines what SHOULD happen first.
  6. Baseline proves value — Run the same cases without the skill to prove it helps.
  7. Scoring stability is verifiable — In Deep Eval mode, the Judge scores 3 times.
  8. Code checks before LLM checks — exact/regex checks use code; only semantic uses LLM.

When to Use

  • You've written a skill and want to know if it's good before sharing
  • You've made changes and want to verify the fix worked
  • A skill "works sometimes" and you need to find exactly which steps fail
  • You need a quality gate before deploying to production
  • Someone asks "is this prompt/skill any good?"

The Evaluation Pipeline

Input Skill -> [Phase 0] Structure Assessment -> [Phase 1] Skill Dissection ->
[Phase 2] Test Case Design -> [Phase 3] Execute & Record ->
[Phase 4] Score & Verify -> [Phase 5] Report & Iterate

Phase 0: Structure Assessment

Evaluate testability before testing:

  1. Run structure checklist: has explicit steps? input/output per step? method specs? constraints?
  2. Determine structure level:
    • High (5-6 checks): Proceed directly
    • Medium (3-4): Supplement step expectations, then proceed
    • Low (0-2): Infer steps from verb-based decomposition
  3. For low-structure skills: identify verbs (search, analyze, extract, generate), order by dependency, each verb = one inferred step marked with "step_source": "inferred"

Phase 1: Skill Dissection

Read the target skill and build a structured profile:

  1. Read SKILL.md, extract frontmatter (name, description, version)
  2. Identify claimed steps — numbered lists, headers, or sequential instructions
  3. For each step, classify operation type: data reading, API calling, web scraping, page manipulation, data processing, content generation, file output, conditional logic
  4. Identify output expectations (format, artifacts, deliverables)
  5. Note safety-relevant instructions (file access, network calls, secrets)

Phase 2: Test Case Design

Design test cases with per-step expected results. Each case has: a task prompt, input context, per-step expected results with check_types, and skill requirements references.

check_type options for must_contain:

  • "exact" — verified by code: value in output
  • "regex" — verified by code: re.search(pattern, output)
  • "semantic" — verified by LLM judgment (use only for abstract concepts)

Rule: Use exact/regex whenever possible. Only use semantic for truly conceptual checks.

How many cases:

  • Quick eval (default): 4 cases — normal, edge, adversarial, one more
  • Deep eval: 8-12 cases covering full input space

Critical rule: Expected results written BEFORE execution. Never adjust after seeing results.

See references/schemas.md for the complete test case JSON structure.

Phase 3: Execute & Record

⚠️ Safety Boundary: Sandbox-First Execution

Before executing any test case against an untrusted skill:

  1. Use a disposable workspace — never run in a production project or with real credentials
  2. Enable approval mode — require human confirmation for all mutating tool calls (file writes, API calls, browser actions, shell commands)
  3. Mock external dependencies — use mock data, test accounts, and stub APIs
  4. Disable high-impact tools — remove or restrict tools that can delete files, send emails, make purchases, or access sensitive systems

If the target skill's operation types include web scraping, page manipulation, API calling, or file output, these MUST be sandboxed or mocked. The evaluator observes behavior but does NOT vouch for the safety of the target skill's actions.

Execution steps:

  1. Run with the skill active in the sandboxed environment, record per-step behavior
  2. Record: action taken, output produced, tool calls, timing data
  3. Identify step boundaries in the execution transcript
  4. Run Baseline (at least once): same cases WITHOUT the skill

Phase 4: Score & Verify

For each step, produce THREE independent scores:

Automated checks first (check_type: exact/regex) — deterministic, code-verified.

LLM scoring (semantic checks + quality judgments):

  • Completion (0/1): Did the core operation execute? (not whether it succeeded)
  • Correctness (0/1/2): Does actual match expected? (0=wrong, 1=partial, 2=full)
  • Execution Quality (0/1/2): Did it follow the Skill's method? (0=ignored, 1=partial, 2=full)

Rules:

  • Completion=0 cascades: correctness and quality automatically become 0
  • Every score below max requires a low_score_reason with expected vs actual
  • Three scores are NEVER combined into a weighted total

Scoring Stability (Deep Eval mode):

  • Judge scores 3 times per step
  • All 3 match = Stable; 2/3 match = Majority; all differ = UNCERTAIN (needs human arbitration)

See references/scoring.md for detailed scoring definitions and references/rubrics.md for operation-type-specific rubrics.

Phase 5: Report & Iterate

Generate report in this priority order:

  1. Bad Case Section (FIRST)
  2. Overview Panel (averages + baseline gain)
  3. Scoring Stability Summary (if Deep Eval)
  4. Step Scores Table
  5. Baseline Comparison
  6. Efficiency Details
  7. Safety Details
  8. Full Case Details

Then iterate: identify root causes, generate optimization plan, create new version, re-test.

See references/report-format.md for visual presentation formats and references/schemas.md for report JSON structure.


Bad Case Definition

A step is a Bad Case if ANY of:

  • Completion = 0 (step not executed)
  • Correctness = 0 (result completely wrong)
  • Execution Quality = 0 (completely violated Skill requirements)
  • Safety finding present

Version Control & Iteration

All evaluation artifacts are versioned:

{workspace}/
  skill/v1/SKILL.md
  skill/v2/SKILL.md
  test-cases/v1/cases.json
  runs/run-{date}-v1/
  reports/v1/report.json
  optimizations/OPT-001.json
  changelog.json

Rules:

  • Skill snapshots are immutable — changes create v(N+1)
  • Test cases only grow — never delete/modify existing
  • Every run is preserved
  • Optimizations reference specific Bad Cases
  • Regressions are flagged immediately

Stop conditions: Bad Cases = 0, Correctness avg >= 1.8/2, no regressions, unsafe rate = 0%.


Reference Files

  • references/schemas.md — JSON schemas for all data structures
  • references/scoring.md — Scoring definitions and computation methods
  • references/rubrics.md — Reusable rubric templates by operation type
  • references/report-format.md — Visual report presentation formats
  • agents/judge.md — The scoring agent protocol
  • agents/advisor.md — The diagnostic advisor protocol
  • scripts/score_engine.py — Score computation engine
  • scripts/safety_scanner.py — Static safety analysis
  • scripts/generate_scorecard.py — HTML report generation
  • scripts/run_trigger_eval.py — Trigger evaluation

The Non-Negotiables

  1. Expected results before execution. No post-hoc grading.
  2. Low scores must have explanations citing expected vs actual.
  3. Bad Cases shown first. Good averages don't hide broken cases.
  4. Three scores stay independent. Never combined.
  5. Versions are immutable. Changes produce new versions.
  6. Every fix traces to a Bad Case. No vibes-based optimization.
  7. Regressions are zero-tolerance.
  8. Structure check before testing.
  9. Baseline proves the skill's value.
  10. Scoring stability is verified (Deep Eval).
  11. Code checks before LLM checks.
  12. Multi-turn skills must test deviation scenarios.

Security & Environment

Platform Compatibility

This skill runs on any AI coding assistant that supports skill/prompt loading:

PlatformSkill LocationTrigger Mechanism
Claude Code.claude/commands/claude -p CLI
Cursor.cursor/rules/ or project rulesAgent mode invocation
CodexSystem prompt / tool configCLI or API invocation
OpenClaw.claw/skills/Skill activation via hub

The core evaluation logic (scoring, reporting, safety scanning) is platform-agnostic. Only the trigger evaluation script requires platform-specific adapters.

Requirements

  • AI CLI (optional): The trigger evaluation script (scripts/run_trigger_eval.py) can launch platform-specific CLI commands to test skill activation. Pass --platform to select the backend (default: auto-detect). Supported: claude, cursor, codex, openclaw.
  • File system access: Trigger probes temporarily write to the platform's skill directory and clean up after completion. Skill names are sanitized to prevent path traversal.

Sandboxing Recommendations

  • Run in a disposable workspace when evaluating unknown or untrusted skills. Evaluated skills are treated as untrusted data — their SKILL.md content may contain adversarial instructions or prompt-injection text.
  • Do not grant unnecessary tools or credentials to the evaluation environment. The evaluator reads skill instructions but does not need network access, secrets, or elevated permissions beyond what the test cases require.
  • HTML reports escape all interpolated values from evaluated skill outputs using html.escape(). Nonetheless, review generated reports in a sandboxed browser context when evaluating skills from unknown sources.