Ab Test Eval

v1.0.2

Run A/B evaluation tests for any skill using subagents. Use when the user wants to test, benchmark, or compare a skill's effectiveness — e.g. 'test this skil...

1· 42·0 current·0 all-time
bySiyuan Huang@cyrushuang1995-cmyk
MIT-0
Download zip
LicenseMIT-0 · Free to use, modify, and redistribute. No attribution required.
Security Scan
VirusTotalVirusTotal
Pending
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (A/B evaluation of a skill) match the instructions: the runtime steps spawn paired subagents, compare outputs, grade assertions, and write benchmark artifacts. The only external requirement is jq, which is plausible for JSON handling in this workflow.
Instruction Scope
SKILL.md explicitly defines the files, workspace layout, and exactly what subagents should do (read skill file for the with-skill branch, do not read it for the baseline). It instructs subagents to document commands rather than execute them. The instructions do not request unrelated files, system credentials, or external endpoints.
Install Mechanism
There is no install spec and no code to write to disk (instruction-only skill). This minimizes risk and is proportionate to the described functionality.
Credentials
The skill declares no environment variables, no credentials, and no config paths. That aligns with an instruction-only benchmarking tool which only needs filesystem write/read for workspace artifacts.
Persistence & Privilege
always:false and no request to modify other skills or system-wide settings. Autonomous invocation is allowed (platform default) but the skill does not request elevated or persistent privileges.
Assessment
This skill is coherent for benchmarking other skills, but take these precautions before using it: (1) Only point the evaluator at SKILL.md files you trust — the with-skill subagent will read that file and could surface any sensitive content it contains. (2) Keep eval workspaces on non-production hosts and avoid placing secrets in eval files. (3) Although the instructions say 'do not run commands', a misbehaving subagent could propose commands; review outputs before running anything. (4) Ensure jq is present on the runner if you rely on the provided tooling. If you plan to test third‑party skills, inspect their SKILL.md first and run the evals in an isolated environment.

Like a lobster shell, security has layers — review code before you run it.

latestvk97159jycc99wwgcn29k95t3s183ybsw

License

MIT-0
Free to use, modify, and redistribute. No attribution required.

Runtime requirements

Binsjq

Comments