Install
openclaw skills install skillprobeA/B evaluates any AI agent skill's real impact through three-role isolation (orchestrator + two sub-agents). Generates skill profiles, synthetic test tasks, runs baseline vs with-skill comparison, performs attribution analysis, and produces structured reports. Use when deciding whether to install a skill, comparing skill versions, investigating performance changes after adding a skill, optimizing an existing skill, or building a skill quality leaderboard.
openclaw skills install skillprobeA/B evaluate whether a skill actually helps, or just adds complexity.
Runs inside the current agent runtime (Cursor, OpenClaw, ClaudeCode). No extra API key required.
Copy this checklist and track progress:
Evaluation Progress:
- [ ] Step 1: Profile the skill (read SKILL.md, extract domain/triggers/boundaries)
- [ ] Step 2: Design eval plan (task categories, count, difficulty mix)
- [ ] Step 3: Generate test tasks (normal + boundary + adversarial)
- [ ] Step 4: Dispatch baseline to Sub-Agent A (no skill content!)
- [ ] Step 5: Dispatch with-skill to Sub-Agent B (include full skill)
- [ ] Step 6: Score both runs (rule + result + optional LLM judge)
- [ ] Step 7: Attribute differences and generate report
Steps 1-3 and 6-7: You (orchestrator) do these. Steps 4-5: Dispatch to isolated sub-agents. NEVER execute tasks yourself.
Create two separate sub-agent sessions. See DISPATCH_PROTOCOL.md for exact prompt templates and constraints.
Key rules:
session_id for each sub-agentCollect outputs from both sub-agents. Score across 6 dimensions (100-point scale). See SCORING_REFERENCE.md for scoring layers, dimension weights, thresholds, and output format.
Inconclusive only after real attempted execution.For local runs outside an agent:
skillprobe evaluate <skill-path> --tasks 30 --repeats 2 --db outputs/evaluations.db
Add --llm-judge [--judge-model <model>] for pairwise judge scoring. The CLI uses whatever LLM provider the local runtime is configured with.
Skill content and task prompts are sent to the configured LLM provider only. All evaluation data stored locally. No telemetry.