Install
openclaw skills install skilloptTrain, evaluate, and improve Agent skill files as reusable external capabilities. Use when a user wants to optimize SKILL.md, prompt procedures, OpenClaw/Hermes/Codex/Claude Code skills, agent workflows, skill factories, benchmark-driven skill iteration, rollout analysis, validation gates, best_skill.md export, or controlled self-evolving skills inspired by Microsoft SkillOpt.
openclaw skills install skilloptTreat a skill document as trainable external state. Keep the target model, tools, and runtime fixed; optimize only the skill text through measured task rollouts, failure reflection, small edits, validation gating, and versioned export.
Default output is a deployable best_skill.md plus a short optimization report. Training may use many traces and candidate files; deployment should require only the final skill file.
Create a run directory near the skill being optimized unless the user specifies another path:
skillopt_runs/<target-skill-slug>/
source_skill.md
candidates/
candidate_000.md
candidate_001.md
tasks/
train.jsonl
val.jsonl
rollouts/
train/
val/
rejected_edits.md
best_skill.md
report.md
Use scripts/skillopt.py for deterministic run setup, JSONL validation, simple command-backed rollouts, score aggregation, validation gates, and report generation. Read references/evaluation.md when defining task schemas or scorers.
Identify:
If no task suite exists, create a small proxy suite first, label it as proxy data, and tell the user that real production traces are needed for stronger conclusions.
Represent each task as JSONL with an id, prompt, optional inputs, and a scorer. Keep validation examples independent and representative.
Minimum split:
train.jsonl: failure discovery and edit proposalval.jsonl: accept/reject gateFor fragile or high-stakes skills, add a hidden or holdout split outside the optimization loop and use it only for final reporting.
Evaluate the unmodified skill on train and validation tasks using the same target agent that will later deploy it.
Examples:
python3 scripts/skillopt.py init --skill path/to/SKILL.md --out skillopt_runs/my-skill
python3 scripts/skillopt.py validate-tasks skillopt_runs/my-skill/tasks/train.jsonl
python3 scripts/skillopt.py run --tasks skillopt_runs/my-skill/tasks/val.jsonl --skill skillopt_runs/my-skill/source_skill.md --out skillopt_runs/my-skill/rollouts/val_baseline --agent-command "hermes -s {skill_path} -z {prompt}"
For OpenClaw or any other agent, replace --agent-command with a command template that accepts {skill_path}, {prompt}, {task_id}, and optionally {output_path}.
Analyze successful and failed rollouts separately.
For each failure, classify the root cause:
Extract patterns across failures before editing. Do not chase one-off errors unless they reveal a generalizable instruction.
Generate one candidate skill with a concise edit rationale:
Keep the candidate deployable as a normal skill. Avoid embedding run logs, benchmark answers, private traces, or optimizer notes in the final skill text.
Run the same validation set on the candidate. Accept only when the candidate beats the baseline by the configured threshold and does not introduce unacceptable regressions.
Default acceptance:
0.02If rejected, append a short note to rejected_edits.md:
## candidate_003
Rejected because validation avg +0.00 and task val_docx_04 regressed.
Avoid adding broad "always rewrite" instructions; they caused format drift.
Repeat rollout, reflection, candidate edit, and validation gate until:
Track the best candidate, not merely the latest candidate.
Copy the best accepted candidate to best_skill.md. If the user wants installation, replace or install the deployed skill only after showing the report summary.
The final report should include:
name and description.SKILL.md folders and invoke with hermes -s <skill-or-path> when testing locally.A good SkillOpt run feels like engineering, not vibes: