Skill Eval

Skill evaluation framework. Use when: testing trigger rate, quality compare (with/without skill), or model comparison. Runs via sessions_spawn + sessions_history. Trigger words: evaluate skill, benchmark, trigger rate, A/B compare. NOT for: debugging conversations, general testing unrelated to skill evaluation.

Xiaoxing9@xiaoxing9

Install

openclaw skills install @xiaoxing9/openclaw-skill-eval

openclaw-eval-skill

Evaluation framework for any OpenClaw skill. No claude CLI dependency — all agent execution runs through sessions_spawn + sessions_history.

Scope: Works with CLI tool skills, conversational skills, and API integration skills.

Runtime Actions Disclosure

This skill performs the following actions during evaluation:

Action	Purpose	When
Read `~/.openclaw/openclaw.json`	Find skill directories (extraDirs)	Path resolution
Write to `eval-workspace/`	Store evaluation results	Every eval run
Call `sessions_spawn`	Run test queries in isolated sessions	Trigger & quality tests
Call `sessions_history`	Collect conversation data for analysis	After each spawn
Persist `cleanup="keep"` sessions	Required for trigger detection	Trigger rate tests

NOT performed automatically: Gateway restart, config modification, skill installation. These require manual user action (see "Bundled Test Skill" section).

Quick Eval

Just say:

text

evaluate weather

The agent will:

Run scripts/resolve_paths.py weather to find all paths
Execute trigger rate + quality compare with detected evals
Output results to eval-workspace/weather/iter-N/

Options:

evaluate weather trigger — trigger rate only
evaluate weather quality — quality compare only
evaluate github --mode all — explicit mode

What gets auto-detected:

Skill path: from OpenClaw built-in skills or registered extraDirs
Evals: from evals/{skill-name}/ or fallback to evals/example-*.json
Output: next available iter-N directory

First step for agent: Run the resolver to get paths:

bash

python scripts/resolve_paths.py {skill-name} --mode {trigger|quality|all}

Use the JSON output to fill in paths for the workflows below.

Bundled Test Skill: fake-tool

A test skill (test-skills/fake-tool/) is included for validating trigger rate detection. It simulates a fictional "Zephyr API" that models cannot know from training.

Manual setup required: The agent will NOT automatically install fake-tool or restart your gateway. If you want to test with fake-tool:

Copy fake-tool to your skills directory:

bash

cp -r test-skills/fake-tool ~/.openclaw/workspace/skills/

Restart OpenClaw gateway (from terminal):
bash
```
openclaw gateway restart
```

Verify registration:

bash

python scripts/resolve_paths.py fake-tool

If step 3 returns a valid path, fake-tool is ready. If "not found", check that your ~/.openclaw/openclaw.json includes the skills directory in skills.load.extraDirs.

Evaluation Scenarios

Tier 1: Core (Always Run)

Scenario	What It Tests	Output
Trigger Rate	Does description trigger SKILL.md reads at the right times? Includes positive (should trigger) AND negative (should NOT trigger) cases.	recall, specificity, precision, F1
Quality Compare	Does skill improve output vs no-skill baseline?	quality_score, assertion pass rate
Description Diagnosis	Why did triggers fail? Analyzes both false negatives AND false positives.	gap analysis, recommendations

Tier 2: Optional (Run When Needed)

Scenario	What It Tests	When to Use
Model Comparison	Quality + speed across haiku/sonnet/opus	Before deployment: which model is enough?
Efficiency Profile	Response time + retry patterns	When skill feels slow: is agent walking wrong paths?

Tier 3: Future (Roadmap)

Scenario	What It Tests	Status
Cross-skill Conflict	Two skills with overlapping descriptions	Planned
Error Recovery	Does agent recover when CLI fails?	Planned

How This Skill Works

Two-layer architecture:

text

Layer 1: Agent (main OpenClaw session) — YOU ARE HERE
  → Reads evals.json
  → Calls sessions_spawn to run subagents
  → Calls sessions_history to collect results
  → Writes raw data to workspace/

Layer 2: Python analysis scripts (run via exec)
  → Read the raw data from workspace/
  → Compute statistics
  → Generate reports

Python scripts (analyze_*.py) are data processors — they cannot call sessions_spawn. The agent drives the workflow.

Usage

Follow USAGE.md for all workflows.

Quick reference:

Workflow	What It Tests	USAGE.md Section
Trigger Rate	Does `description` trigger SKILL.md reads at the right times?	Workflow 1
Quality Compare	Does skill improve output vs no-skill baseline?	Workflow 2
Model Comparison	Quality + Speed across haiku/sonnet/opus	Workflow 3
Latency Profile	Response time p50/p90	Workflow 4

Each workflow follows the same pattern:

Agent spawns subagents using sessions_spawn
Agent collects histories using sessions_history
Agent writes raw data to workspace/{skill}/iter-{n}/raw/
Agent runs analysis script via exec

Core Principles

Never modify the evaluated skill — observe only, give recommendations
Keep eval records in workspace — output goes to eval-workspace/<skill-name>/iteration-N/
Keep full records — save full_history.json (including tool_use + tool_result)

agents/ Reference

File	Purpose	When to Use
`grader.md`	Check assertions, record behavior anomalies, give priority recommendations	Required for every Quality Compare eval
`comparator.md`	Blind A/B comparison without assertions	When unbiased comparison is needed
`analyzer.md`	Analyze cross-eval patterns after all evals complete	Post-analysis

Directory Structure

text

eval-workspace/<skill-name>/
├── evals.json                    ← Eval definition (shared across iterations)
└── iteration-1/
    ├── raw/
    │   ├── histories/            ← Trigger test session histories
    │   └── transcripts/          ← Quality compare transcripts
    ├── trigger_results.json      ← analyze_triggers output
    ├── quality_results.json      ← analyze_quality output
    └── diagnostics/
        └── RECOMMENDATIONS.md

evals.json Format

Quality Compare (prompt + assertions):

json

{
  "skill_name": "my-skill",
  "evals": [
    {
      "id": 1,
      "name": "onboarding-fresh",
      "prompt": "Check the weather in Tokyo",
      "context": "Clean machine, no prior setup. For grader only.",
      "expected_output": "Install → configure → verify profile",
      "assertions": [
        {
          "id": "a1-1",
          "description": "Install command executed",
          "type": "output_contains",
          "value": "pip install"
        },
        {
          "id": "a1-2",
          "description": "Profile verified after setup",
          "type": "output_contains",
          "value": "profile current",
          "priority": true
        }
      ]
    }
  ]
}

Trigger Rate (query + expected):

json

{
  "id": 1,
  "name": "direct-weather",
  "query": "What's the weather in Singapore?",
  "expected": true,
  "category": "positive"
}

Assertion Types

Type	Detection
`output_contains`	Value appears in conversation or tool output
`output_not_contains`	Value does not appear
`output_count_max`	Occurrences ≤ max
`tool_called`	Specific tool called at least once
`tool_not_called`	Specific tool not called
`conversation_contains`	Value appears anywhere in conversation
`conversation_contains_any`	At least one value appears

Priority assertions ("priority": true): any failure → overall=FAIL. Gap assertions ("note": "Best practice..."): failure = skill design gap.

Issue Priority (grader output)

text

🔴 P0 Critical  — Core functionality broken
🟠 P1 High      — Significantly impacts usability
🟡 P2 Medium    — Room for improvement
🟢 P3 Low       — Minor polish

Behavior Anomaly Tracking

Grader records these signals beyond assertions:

Field	Trigger
`path_corrections`	Wrong path then self-corrected
`retry_count`	Same command executed multiple times
`missing_file_reads`	Attempted to read non-existent files
`skipped_steps`	Steps required by skill were not executed
`hallucinations`	Fabricated non-existent commands/APIs

Key Constraints

sandbox="inherit" — subagents inherit skill registration environment
cleanup="keep" — history must be retained for trigger detection
Skill must be in a real directory under skills.load.extraDirs (symlinks rejected)

Skill Eval

Install

openclaw-eval-skill

Runtime Actions Disclosure

Quick Eval

Bundled Test Skill: fake-tool

Evaluation Scenarios

Tier 1: Core (Always Run)

Tier 2: Optional (Run When Needed)

Tier 3: Future (Roadmap)

How This Skill Works

Usage

Core Principles

agents/ Reference

Directory Structure

evals.json Format

Assertion Types

Issue Priority (grader output)

Behavior Anomaly Tracking

Key Constraints

Related skills