Install
openclaw skills install ab-test-evalRun A/B evaluation tests for any OpenClaw skill, script, hook, or cron job. Make sure to use this skill whenever the user mentions testing, benchmarking, comparing, or evaluating a skill, script, hook, or cron job — even if they don't explicitly ask for 'AB testing'. Supports 10 eval modes: baseline comparison, regression testing, model-swap, prompt variants, trigger accuracy, adversarial robustness, script validation, hook dry-run, cron dry-run, and integration testing.
openclaw skills install ab-test-evalEvaluate any OpenClaw component (skill, script, hook, cron job) by spawning parallel subagents and comparing arms. Supports multiple eval modes, auto-grading, and regression tracking.
Pick the mode that matches the user's intent:
| Mode | Question | Arms |
|---|---|---|
| baseline | Does the skill help at all? | with-skill vs without-skill |
| regression | Did changes break anything? | skill-v2 vs skill-v1 |
| model-swap | Works on another model? | model-A vs model-B |
| prompt-variant | Which description works better? | variant-A vs variant-B |
| trigger-accuracy | Dispatches correctly? | should-trigger vs should-not |
| adversarial | Robust against bad inputs? | clean vs perturbed |
| script-test | Script produces correct output? | script-A vs script-B |
| hook-dryrun | Hook responds correctly? | with-hook vs without |
| cron-dryrun | Cron payload does the right thing? | cron-run vs baseline |
| integration | Full stack works together? | full vs missing-component |
Default to baseline if unclear.
Create the eval workspace as a sibling to the skill directory:
<skill-dir>/evals/evals.json
<skill-dir>/<skill-name>-workspace/
iteration-1/
<eval-name>/
<arm-a>/
outputs/commands.md
timing.json
grading.json
<arm-b>/
outputs/commands.md
timing.json
grading.json
eval_metadata.json
benchmark.json
benchmark.md
iteration-2/
...
history.jsonl
Create directories with mkdir -p. Use descriptive arm names (e.g. with_skill, without_skill, new_version, old_version).
Read <skill-dir>/evals/evals.json and present the cases to the user for confirmation before running. Do not auto-run without sign-off.
Generate them by reading the skill's SKILL.md and creating 4-6 realistic eval cases:
Write to <skill-dir>/evals/evals.json:
{
"skill_name": "my-skill",
"evals": [
{
"id": 1,
"prompt": "Realistic user request",
"expected_output": "What correct behavior looks like",
"files": []
}
]
}
Then show them to the user: "Here are the test cases I plan to run. Do these look right, or do you want to add more?"
Wait for approval before spawning subagents.
Before spawning expensive subagents, offer the user two efficiency controls (especially useful when eval count > 3 or arms > 2).
--dry PreviewGenerate a preview report that lists exactly what will run, without spawning any subagents:
# Eval Preview Report
- Mode: baseline
- Evals: 4
- Arms per eval: 2 (with-skill, without-skill)
- Model: current
- Estimated subagent calls: 8
Evals:
1. happy-path-basic — 2 arms, 3 assertions
2. ambiguous-request — 2 arms, 3 assertions
Present this to the user and ask: "This looks like X evals across Y arms. Should I proceed, or do you want to trim the list?"
--smoke Smoke TestIf the user wants a quick confidence check, run only the first eval end-to-end (all arms + grading). This verifies the pipeline works before committing to the full run.
After a successful smoke test, ask: "Smoke test passed. Should I run the remaining N evals now?"
While waiting for user approval (or while subagents run), draft assertions in eval_metadata.json for each eval.
Save to <workspace>/iteration-N/<eval-name>/eval_metadata.json:
{
"eval_id": 1,
"eval_name": "happy-path-basic",
"prompt": "The user's task prompt",
"assertions": [
{
"text": "Uses the --force flag",
"expected": true
},
{
"text": "Warns about OAuth timeout gotcha",
"expected": true
}
]
}
Assertions use text and expected fields. These are the basis for grading.
For each eval, spawn all arms in the same turn. Launch as many as the environment allows concurrently.
SKILL.md, execute prompt, save outputsSKILL.mdcp -r snapshot before editing)SKILL.mdSKILL.mdEach prompt gets ONE subagent tasked as the dispatcher:
"You are the dispatcher. Given this user prompt, would you load
<skill-path>/SKILL.mdbefore responding? Answer yes/no and explain why." Save yes/no explanations, then grade TP/FP/TN/FN.
<hook-type> event with this payload. Given this hook's SKILL.md or config, what would you do?"jobs.json or cron config).exec dry-run context.Task template for standard arms:
Execute this task:
- Arm: <arm-name>
- Skill path: <absolute-path> or "none"
- Model override: <model> or "default"
- Task: <eval prompt>
- Input files: <files or "none">
- Save outputs to: <workspace>/iteration-N/<eval-name>/<arm>/outputs/commands.md
- Execute the task using available tools — if the subagent has tool access, run commands for real; if not, document what would be done.
When each subagent completes, its notification includes total_tokens and duration_ms. This is the only chance to capture it.
Save to <arm>/timing.json:
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
Process each notification as it arrives rather than batching.
Spawn a grading subagent per eval to compare all arms against the assertions:
Read the following files:
- <workspace>/iteration-N/<eval-name>/<arm-a>/outputs/commands.md
- <workspace>/iteration-N/<eval-name>/<arm-b>/outputs/commands.md
Eval prompt: <prompt>
Expected output: <expected_output>
Grade each arm against these assertions:
<assertions from eval_metadata.json>
For each arm, save a separate grading.json with:
- A top-level "expectations" array with text/passed/evidence
- A "summary" with passed/failed/total/pass_rate
Save arm A results to: <workspace>/iteration-N/<eval-name>/<arm-a>/grading.json
Save arm B results to: <workspace>/iteration-N/<eval-name>/<arm-b>/grading.json
Each grading.json schema:
{
"expectations": [
{
"text": "Uses the --force flag",
"passed": true,
"evidence": "Output contains 'clawhub update --force'"
}
],
"summary": {
"passed": 3,
"failed": 0,
"total": 3,
"pass_rate": 1.0
}
}
For trigger-accuracy runs, save a separate trigger_grading.json with tp, fp, tn, fn tallies at the eval level.
Write benchmark.json:
{
"metadata": {
"skill_name": "my-skill",
"mode": "baseline",
"model": "current-model-id",
"timestamp": "2026-04-11T05:30:00+08:00",
"evals_run": [1, 2, 3],
"arms": ["with_skill", "without_skill"]
},
"evals": [
{
"eval_id": 1,
"eval_name": "happy-path-basic",
"arms": {
"with_skill": {
"pass_rate": 1.0,
"passed": 3,
"total": 3,
"tokens": 12345,
"duration_seconds": 17.6
},
"without_skill": {
"pass_rate": 0.67,
"passed": 2,
"total": 3,
"tokens": 18590,
"duration_seconds": 29.0
}
}
}
],
"totals": {
"with_skill": { "pass_rate": 0.85, "passed": 17, "total": 20 },
"without_skill": { "pass_rate": 0.45, "passed": 9, "total": 20 }
},
"delta": {
"pass_rate": "+0.40"
},
"notes": [
"Arm with_skill consistently better on safety assertions",
"eval-3 edge-case shows no difference — consider strengthening skill"
]
}
Append a compact line to history.jsonl for regression tracking.
Then write benchmark.md with:
Present the summary to the user directly in chat.
iteration-(N+1)/history.jsonl entries for trendcp -r). Use previous version as baseline.sessions_spawn with model override.exec for deterministic results unless script invokes LLM. Check happy-path AND error handling.--dry and --smoke — offer preview / smoke-test paths to improve UX and reduce wasted tokens