Skill Eval Preflight

Security

Validate OpenClaw skills during authoring. Use when creating, revising, or preparing a skill for release and you need to scaffold `evals/` files, check readiness for a first eval pass, review whether the frontmatter description has clear trigger coverage, or generate static comparison artifacts before deeper runtime evaluation.

Install

openclaw skills install skill-eval-preflight

Skill Eval

Use this skill as an authoring-side preflight for OpenClaw skills.

It is not a full runtime evaluator. It helps a skill author move from "this skill exists" to "this skill is structured well enough for first-pass evaluation and later regression work."

Good Requests

This skill is a good fit for requests like:

"Set up eval files for this skill before I publish it."
"Check whether this skill is ready for a first eval pass."
"Review the description and tell me whether trigger coverage is clear enough."
"Generate with-skill and without-skill static comparison artifacts for this skill."

Not A Good Fit

Do not rely on this skill alone for requests like:

large-scale live runtime benchmarking
scoring response quality across many real conversations
tool-call correctness or factuality audits
end-to-end production regression testing

Use a deeper evaluator after this step when you need those capabilities.

Best Fit

Use this skill when you need to:

initialize evals/ files for a new or existing skill
confirm a skill is ready for a first eval pass
make positive and negative trigger coverage explicit
catch placeholder content before sharing a skill
write static run summaries and with-skill/without-skill comparison artifacts

Use a deeper evaluator after this step when you need live runtime experiments, tool-call quality checks, or richer output scoring.

Position In The Flow

Recommended sequence:

skill-vetter -> install/review -> skill-eval -> deeper runtime eval

skill-vetter answers: "Is this skill safe enough to inspect or install?"
skill-eval answers: "Is this skill structured well enough to evaluate seriously?"
a deeper evaluator answers: "How well does the skill perform in practice?"

Workflow

Confirm the target folder is a skill directory with SKILL.md.
If the skill came from another repo or another person, do a safety review first.
If evals/ does not exist, initialize it with:
- evals/evals.json
- evals/triggers.json
- evals/README.md
Replace placeholder prompts with realistic authoring examples.
Run the readiness check before any deeper benchmarking.
If readiness fails, fix the missing pieces first instead of forcing a run.
Generate static run artifacts only after the inputs are usable.

Scripts

Initialize eval files:

python3 scripts/init_eval.py /path/to/skill

Check readiness:

python3 scripts/check_eval_readiness.py /path/to/skill

Run static eval checks:

python3 scripts/run_eval.py /path/to/skill
python3 scripts/run_eval.py /path/to/skill --check readiness
python3 scripts/run_eval.py /path/to/skill --check triggers
python3 scripts/run_eval.py /path/to/skill --check artifacts
python3 scripts/run_eval.py /path/to/skill --check files
python3 scripts/run_eval.py /path/to/skill --mode with-skill
python3 scripts/run_eval.py /path/to/skill --mode without-skill --run-group demo-baseline
python3 scripts/compare_runs.py /path/to/skill --run-group demo-baseline

Readiness Rules

A skill is ready for first-pass evaluation only when:

SKILL.md exists
the frontmatter description is real and not a placeholder
evals/evals.json has at least one non-placeholder eval case
evals/triggers.json has at least one positive and one negative non-placeholder trigger case

What This Skill Checks Well

missing or empty eval scaffolding
placeholder prompts that would make an eval meaningless
missing positive/negative trigger coverage
empty or malformed expected_artifacts
malformed optional files declarations
static with-skill/without-skill run artifact organization

Current Limits

run_eval.py does not perform live trigger experiments against the OpenClaw runtime. It does not score real outputs for quality, factuality, or tool correctness.

Today it performs static validation passes that:

verify trigger files exist
verify cases are non-placeholder
verify positive and negative sets are both populated
verify eval cases have usable expected_artifacts
verify declared files entries are well-formed
write mode-specific run summaries for later comparison

Why Publish This Skill

This skill is for authors who do not yet need a full eval lab, but do need a clean starting point. It is most useful as a lightweight preflight and scaffolding step before deeper evaluation.

Release Readiness Checklist

Before calling a skill "ready for release," aim for all of the following:

the description names concrete trigger scenarios
positive and negative trigger cases both exist
placeholder content is gone
each eval case describes observable expected artifacts
static run summaries can be generated without errors

Compare Runs

Use compare_runs.py after both modes exist in the same run-group.

It compares:

overall pass/fail
per-check pass/fail
mode-specific errors
mode-specific notes

It writes comparison artifacts under the run-group root.

References

Read references/eval_format.md when you need the expected file formats and field meanings.