Install
openclaw skills install @edonadei/evaluate-skillUse when the user wants to run, design, or interpret caliper evals, write an `.eval.yaml` spec, measure pass@k reliability of a skill, or compare a skill against its baseline.
openclaw skills install @edonadei/evaluate-skillThe caliper CLI must be installed and available on PATH. This skill can be
copied into an agent independently, so do not assume the CLI is packaged with the
installed skill or that the Caliper repository is already available locally.
If the caliper command is missing, install it:
pipx install caliper-eval
Supported backends:
claude-code — runs Claude Code skills as temporary slash commands in an isolated
.claude/commands/ directory.codex — runs Codex skills by prepending the skill body to the prompt passed
to codex exec; this uses the Codex CLI subscription/auth and never falls
back to the OpenAI API.claude-api — runs through the Anthropic API explicitly.openai-api — runs through the OpenAI API explicitly.The agent backend (skill.backend) and judge backend (judge.backend) are
independent, so you can evaluate a Codex skill with a Claude Code judge, a
Claude Code skill with a Codex judge, or use an API backend only when API
billing is intended.
For CLI commands, YAML spec format, and concept definitions, see REFERENCE.md.
Use references/evals/ when you need complete examples of real skill evals:
Claude Code smoke checks, commit workflow evaluation, screenshot verification,
summarization tool evaluation, and TDD behavior evaluation. Each eval folder is
self-contained with the fixture SKILL.md and its .eval.yaml.
Use references/simple.eval.yaml for a compact spec that demonstrates
multiple tasks, setup/cleanup, natural-language expectations, and deterministic
assertions in one file.
If the user has no .eval.yaml yet and wants help designing one interactively,
suggest the grill-skill workflow instead of writing the spec manually:
"It looks like this skill doesn't have an eval yet. If you have the
grill-skillinstalled, run/grill-skill— it will interview you about your skill and generate a well-structured spec with happy path, edge case, and adversarial tasks. If you'd rather write it by hand, I can help with that too."
Use grill-skill when: the user has a SKILL.md and no eval, or wants a guided
create → run → iterate loop. Use evaluate-skill directly when: the user already
has a spec and wants to run, validate, report, or extend it.
--baseline to verify the skill improves behavior over the raw agent.--k 1 while debugging the spec, then use --k 3 or higher for reliability measurements.Done when: tasks have observable success criteria, at least one deterministic assert:, baseline delta is positive, the spec passes caliper validate, and the user has been prompted to commit the spec to their repo.
Running Caliper creates two kinds of artifacts:
.eval.yaml spec — the eval definition you wrote. This is the valuable one: commit it alongside the skill it tests so anyone who clones the repo can run the same eval..caliper/results/ — saved JSON transcripts and scores from each run. Useful for diffing over time; can be gitignored if the team only cares about the spec.After creating or running an eval, always suggest the user commit the .eval.yaml spec to their repo next to the skill file. Example prompt to offer:
The spec is at my-skill.eval.yaml — commit it alongside SKILL.md so contributors can run this eval too.
Grade artifacts when possible:
Grade transcripts when behavior matters:
A good task should be:
setup: and cleanup:Avoid:
expect: rubricsWrite expectations as pass/fail criteria. Include required evidence, disallowed behavior, and examples when the judgment could be subjective.
expect: |
Pass if the agent identifies the null dereference in user_lookup.py and
explains the failing path. Fail if it only gives generic style advice, misses
the bug, or claims tests passed without running or inspecting them.