AB Test Eval

Dev Tools

Run A/B evaluation tests for any OpenClaw skill, script, hook, or cron job. Make sure to use this skill whenever the user mentions testing, benchmarking, comparing, or evaluating a skill, script, hook, or cron job — even if they don't explicitly ask for 'AB testing'. Supports 10 eval modes: baseline comparison, regression testing, model-swap, prompt variants, trigger accuracy, adversarial robustness, script validation, hook dry-run, cron dry-run, and integration testing.

Install

openclaw skills install ab-test-eval

AB Test Eval — Automated Component Benchmarking via Subagents

Evaluate any OpenClaw component (skill, script, hook, cron job) by spawning parallel subagents and comparing arms. Supports multiple eval modes, auto-grading, and regression tracking.

Step 1: Choose the Eval Mode

Pick the mode that matches the user's intent:

Mode	Question	Arms
baseline	Does the skill help at all?	with-skill vs without-skill
regression	Did changes break anything?	skill-v2 vs skill-v1
model-swap	Works on another model?	model-A vs model-B
prompt-variant	Which description works better?	variant-A vs variant-B
trigger-accuracy	Dispatches correctly?	should-trigger vs should-not
adversarial	Robust against bad inputs?	clean vs perturbed
script-test	Script produces correct output?	script-A vs script-B
hook-dryrun	Hook responds correctly?	with-hook vs without
cron-dryrun	Cron payload does the right thing?	cron-run vs baseline
integration	Full stack works together?	full vs missing-component

Default to baseline if unclear.

Step 2: Prepare Directory Structure

Create the eval workspace as a sibling to the skill directory:

<skill-dir>/evals/evals.json
<skill-dir>/<skill-name>-workspace/
  iteration-1/
    <eval-name>/
      <arm-a>/
        outputs/commands.md
        timing.json
        grading.json
      <arm-b>/
        outputs/commands.md
        timing.json
        grading.json
      eval_metadata.json
    benchmark.json
    benchmark.md
  iteration-2/
    ...
  history.jsonl

Create directories with mkdir -p. Use descriptive arm names (e.g. with_skill, without_skill, new_version, old_version).

Step 3: Define or Generate Evals

If evals already exist

Read <skill-dir>/evals/evals.json and present the cases to the user for confirmation before running. Do not auto-run without sign-off.

If evals are missing

Generate them by reading the skill's SKILL.md and creating 4-6 realistic eval cases:

Happy path — clear request the skill should nail
Ambiguous request — could go multiple ways
Edge case — unusual params or corner case
Negative case — similar but should NOT trigger this skill
Multi-step case — complex multi-tool request
Adversarial case (if mode=adversarial) — misleading / typo / injected junk

Write to <skill-dir>/evals/evals.json:

{
  "skill_name": "my-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "Realistic user request",
      "expected_output": "What correct behavior looks like",
      "files": []
    }
  ]
}

Then show them to the user: "Here are the test cases I plan to run. Do these look right, or do you want to add more?"

Wait for approval before spawning subagents.

Step 4: Efficiency Controls — Dry-Run Preview & Smoke Test

Before spawning expensive subagents, offer the user two efficiency controls (especially useful when eval count > 3 or arms > 2).

`--dry` Preview

Generate a preview report that lists exactly what will run, without spawning any subagents:

# Eval Preview Report
- Mode: baseline
- Evals: 4
- Arms per eval: 2 (with-skill, without-skill)
- Model: current
- Estimated subagent calls: 8

Evals:
1. happy-path-basic — 2 arms, 3 assertions
2. ambiguous-request — 2 arms, 3 assertions

Present this to the user and ask: "This looks like X evals across Y arms. Should I proceed, or do you want to trim the list?"

`--smoke` Smoke Test

If the user wants a quick confidence check, run only the first eval end-to-end (all arms + grading). This verifies the pipeline works before committing to the full run.

After a successful smoke test, ask: "Smoke test passed. Should I run the remaining N evals now?"

Step 5: Write Assertions

While waiting for user approval (or while subagents run), draft assertions in eval_metadata.json for each eval.

Save to <workspace>/iteration-N/<eval-name>/eval_metadata.json:

{
  "eval_id": 1,
  "eval_name": "happy-path-basic",
  "prompt": "The user's task prompt",
  "assertions": [
    {
      "text": "Uses the --force flag",
      "expected": true
    },
    {
      "text": "Warns about OAuth timeout gotcha",
      "expected": true
    }
  ]
}

Assertions use text and expected fields. These are the basis for grading.

Step 6: Spawn Subagents in Parallel

For each eval, spawn all arms in the same turn. Launch as many as the environment allows concurrently.

Baseline mode

with_skill: load SKILL.md, execute prompt, save outputs
without_skill: same prompt, no skill, save outputs

Regression mode

new_skill: load updated SKILL.md
old_skill: load a snapshot of the previous version (make a cp -r snapshot before editing)

Model-swap mode

model-a: run with skill + model A override
model-b: run with skill + model B override

Prompt-variant mode

variant-a: load skill variant A's SKILL.md
variant-b: load skill variant B's SKILL.md

Trigger-accuracy mode

Each prompt gets ONE subagent tasked as the dispatcher:

"You are the dispatcher. Given this user prompt, would you load <skill-path>/SKILL.md before responding? Answer yes/no and explain why." Save yes/no explanations, then grade TP/FP/TN/FN.

Adversarial mode

clean: normal prompt + skill
perturbed: prompt with typos / injected irrelevance / misleading framing + skill

Script-test mode

Run the bundled script with controlled inputs and assert on stdout, exit code, and generated files.
Arms can be: current-script vs previous-script, or script-with-skill-guidance vs naive-approach.
Assertions focus on correctness, idempotency, and edge-case handling.

Hook-dryrun mode

Simulate a hook event by spawning a subagent and telling it: "Pretend you are an OpenClaw agent receiving a <hook-type> event with this payload. Given this hook's SKILL.md or config, what would you do?"
Do NOT modify actual system hook registrations. This is a read-only simulation.

Cron-dryrun mode

Extract the cron job's payload (task command or script path from jobs.json or cron config).
Run the payload in an isolated subagent or exec dry-run context.
Assert on expected side effects, file outputs, or command sequence.
Also verify the cron expression is valid and produces expected schedule times.

Integration mode

Test the full stack: user prompt → skill dispatch → script execution → hook response.
Arms: full-stack vs missing-script vs missing-hook vs skill-only.

Task template for standard arms:

Execute this task:
- Arm: <arm-name>
- Skill path: <absolute-path> or "none"
- Model override: <model> or "default"
- Task: <eval prompt>
- Input files: <files or "none">
- Save outputs to: <workspace>/iteration-N/<eval-name>/<arm>/outputs/commands.md
- Execute the task using available tools — if the subagent has tool access, run commands for real; if not, document what would be done.

Step 7: Capture Timing from Notifications

When each subagent completes, its notification includes total_tokens and duration_ms. This is the only chance to capture it.

Save to <arm>/timing.json:

{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}

Process each notification as it arrives rather than batching.

Step 8: Auto-Grade with LLM-as-Judge

Spawn a grading subagent per eval to compare all arms against the assertions:

Read the following files:
- <workspace>/iteration-N/<eval-name>/<arm-a>/outputs/commands.md
- <workspace>/iteration-N/<eval-name>/<arm-b>/outputs/commands.md

Eval prompt: <prompt>
Expected output: <expected_output>

Grade each arm against these assertions:
<assertions from eval_metadata.json>

For each arm, save a separate grading.json with:
- A top-level "expectations" array with text/passed/evidence
- A "summary" with passed/failed/total/pass_rate

Save arm A results to: <workspace>/iteration-N/<eval-name>/<arm-a>/grading.json
Save arm B results to: <workspace>/iteration-N/<eval-name>/<arm-b>/grading.json

Each grading.json schema:

{
  "expectations": [
    {
      "text": "Uses the --force flag",
      "passed": true,
      "evidence": "Output contains 'clawhub update --force'"
    }
  ],
  "summary": {
    "passed": 3,
    "failed": 0,
    "total": 3,
    "pass_rate": 1.0
  }
}

For trigger-accuracy runs, save a separate trigger_grading.json with tp, fp, tn, fn tallies at the eval level.

Step 9: Aggregate and Generate Report

Write benchmark.json:

{
  "metadata": {
    "skill_name": "my-skill",
    "mode": "baseline",
    "model": "current-model-id",
    "timestamp": "2026-04-11T05:30:00+08:00",
    "evals_run": [1, 2, 3],
    "arms": ["with_skill", "without_skill"]
  },
  "evals": [
    {
      "eval_id": 1,
      "eval_name": "happy-path-basic",
      "arms": {
        "with_skill": {
          "pass_rate": 1.0,
          "passed": 3,
          "total": 3,
          "tokens": 12345,
          "duration_seconds": 17.6
        },
        "without_skill": {
          "pass_rate": 0.67,
          "passed": 2,
          "total": 3,
          "tokens": 18590,
          "duration_seconds": 29.0
        }
      }
    }
  ],
  "totals": {
    "with_skill": { "pass_rate": 0.85, "passed": 17, "total": 20 },
    "without_skill": { "pass_rate": 0.45, "passed": 9, "total": 20 }
  },
  "delta": {
    "pass_rate": "+0.40"
  },
  "notes": [
    "Arm with_skill consistently better on safety assertions",
    "eval-3 edge-case shows no difference — consider strengthening skill"
  ]
}

Append a compact line to history.jsonl for regression tracking.

Then write benchmark.md with:

Executive summary (delta, winner, biggest weaknesses)
Per-eval breakdown table
Notable failures with quotes
Recommendations for improving the skill

Present the summary to the user directly in chat.

Step 10: Iterate Based on Feedback

Discuss results with the user
Improve the skill based on failed assertions
Rerun into iteration-(N+1)/
Compare history.jsonl entries for trend
Repeat until satisfied

Mode-Specific Notes

Regression: Snapshot old skill before editing (cp -r). Use previous version as baseline.
Model-swap: Use sessions_spawn with model override.
Prompt-variant: Create two temp skill copies with different descriptions.
Trigger-accuracy: Generate 10 queries (5 should-trigger, 5 near-miss should-not). Grade precision/recall/F1.
Adversarial: Perturbations include typos, irrelevant context injection, misleading framing. Report degradation score = clean_avg - perturbed_avg.
Script-test: Run via exec for deterministic results unless script invokes LLM. Check happy-path AND error handling.
Hook-dryrun: Simulate event via subagent with exact payload JSON. Do NOT modify actual hook registrations.
Cron-dryrun: Validate cron expression and list next N execution times. If payload sends messages, use dry-run constraint.
Integration: For missing-component arms, tell subagent: "You do NOT have access to <component X>."

Hard Constraints

Do not auto-run evals without user sign-off — present evals and wait for approval before spawning
Respect --dry and --smoke — offer preview / smoke-test paths to improve UX and reduce wasted tokens