Install
openclaw skills install @lanyasheng/auto-skill-forge当需要为已有 Skill 自动生成 task_suite.yaml 测试任务集、从 skill_spec.yaml 生成完整 SKILL.md + task_suite、或一键走完「生成 → 评估 → 改进」全链路时使用。 读取 SKILL.md 中的 frontmatter/When to Use/example/anti-example/Output 五类来源, 自动推导 5-10 个 test task 并选择 ContainsJudge / LLMRubricJudge / PytestJudge。 不用于手动编写 SKILL.md(用 skill-creator)、单独评估已有 task_suite(用 improvement-evaluator)、 或驱动改进循环(用 improvement-orchestrator)。
openclaw skills install @lanyasheng/auto-skill-forgeGenerate Skills from requirements AND generate task_suite.yaml for evaluation.
The primary value of this skill is task_suite generation -- turning a SKILL.md into a structured test harness that improvement-evaluator can run. Secondary value is generating SKILL.md from a structured skill_spec.yaml.
Key differentiator: Skill Forge does not merely scaffold a skeleton; it performs static analysis of the SKILL.md to extract testable claims (from five distinct sources) and assigns the appropriate judge type per task, producing a suite that is immediately runnable by improvement-evaluator.
task_suite.yaml.
Run --from-skill to analyze the SKILL.md and generate a test suite automatically.
The generator extracts scenarios from five sections of the document and assigns
the right judge per task.skill_spec.yaml describing
what the skill should do. Run --from-spec to generate both a complete SKILL.md
(with frontmatter, sections, examples) and a matching task_suite.yaml.--evaluate, and optionally
improve it with --auto-improve in one pass. Combines skill-forge,
improvement-evaluator, and improvement-orchestrator into a single command.task_suite.yaml
→ use improvement-evaluatorimprovement-orchestratorskill-creatorimprovement-discriminatorskill-distill--from-skill (Generate task suite for existing SKILL.md)# Generate test suite for an existing skill
python3 scripts/forge.py --from-skill /path/to/skill-dir --output /path/to/output
# Generate and immediately evaluate
python3 scripts/forge.py --from-skill /path/to/skill-dir --output /path/to/output --evaluate
Reads the SKILL.md, extracts scenarios from five sources in priority order:
<example> tags → up to 2 keyword-match tests (uses ContainsJudge)<anti-example> tags → up to 2 negative tests (should avoid bad patterns)Produces task_suite.yaml in the output directory. Example generated output:
skill_id: release-notes-generator
version: "1.0"
generated_by: skill-forge
tasks:
- id: release-notes-generator-core-capability
description: "Test core capability described in skill description"
prompt: "You are an AI assistant with this skill loaded..."
judge:
type: llm-rubric
rubric: "The output should demonstrate the capability..."
pass_threshold: 0.7
timeout_seconds: 120
source: frontmatter.description
- id: release-notes-generator-use-case-01
description: "Use case: Generate notes from git log between tags"
prompt: "Scenario: Generate notes from git log..."
judge:
type: llm-rubric
rubric: "The output should address this use case..."
pass_threshold: 0.6
timeout_seconds: 120
source: when_to_use
--from-spec (Generate skill + task suite from spec)# Generate complete skill from a spec file
python3 scripts/forge.py --from-spec spec.yaml --output /path/to/output
# Generate, evaluate, and auto-improve if below SOLID grade
python3 scripts/forge.py --from-spec spec.yaml --output /path/to/output --auto-improve
Reads a skill_spec.yaml (see references/spec-format.md) and generates:
task_suite.yaml derived from the generated SKILL.md (same five-source extraction)The spec format requires only name and purpose; optional fields (inputs, outputs, quality_criteria, domain_knowledge, reference_skills) enrich the generated SKILL.md. Example minimal spec:
name: release-notes-generator
purpose: Generate structured release notes from git commit history
inputs:
- name: commits
type: git-log
description: "Git commit log between two tags"
outputs:
- name: release-notes
format: markdown
description: "Structured release notes with sections"
quality_criteria:
- name: completeness
description: "All commits accounted for in the notes"
weight: 0.3
| Flag | Effect |
|---|---|
--mock | Use mock LLM (for testing without API calls) |
--evaluate | Run improvement-evaluator after generation (requires it installed) |
--auto-improve | Run improvement-orchestrator if score below SOLID (requires it installed) |
| Request | Deliverable | Location |
|---|---|---|
--from-skill | task_suite.yaml with 5-10 test tasks | <output>/task_suite.yaml |
--from-spec | SKILL.md + task_suite.yaml | <output>/<name>/SKILL.md, <output>/<name>/task_suite.yaml |
--evaluate | Evaluation report (pass/fail per task, aggregate pass rate) | stdout + <output>/evaluation_report.json |
--auto-improve | Improved SKILL.md (if score was below SOLID) | in-place update of SKILL.md |
All generated YAML files use allow_unicode: True and default_flow_style: False for human readability. Files are written atomically (write-then-rename) to prevent corruption on crash.
The generator extracts test scenarios from 5 sources in the SKILL.md:
<example> tags → up to 2 keyword-match tests<anti-example> tags → up to 2 anti-pattern avoidance testsTasks are deduplicated and capped at 10 per suite.
ContainsJudgeLLMRubricJudgePytestJudge (if test script can be generated)When the target skill has a scripts/ directory, forge auto-generates additional test tasks checking execution-harness pattern adoption:
subprocess.TimeoutExpired?write_json/write_text from lib/common (atomic write-then-rename)?These tasks use ContainsJudge to grep the skill's Python source code. They only apply to orchestration/tool-type skills — pure-text knowledge skills skip this category.
A generated task suite is only useful if it measures the skill's contribution, not the base LLM's general ability. Without calibration, a naive suite can report 80%+ pass rates even when the SKILL.md adds zero value -- because the LLM already knows how to answer those questions.
Null-skill calibration addresses this by running every candidate task against a "null skill" (empty context, no SKILL.md loaded). Any task the null skill passes trivially is filtered out before the final suite is emitted. This ensures that every surviving task genuinely requires the knowledge or structure encoded in the SKILL.md.
Tradeoff: Null-skill calibration adds one extra LLM call per candidate task (or a heuristic keyword check in --mock mode). For a typical 10-task candidate set, this means ~10 additional calls during suite generation. The cost is justified because an uncalibrated suite gives false confidence: a skill that scores 9/10 on easy tasks looks "SOLID" but may add no value over a bare model. Calibrated suites reliably distinguish genuine skill contributions from baseline LLM capability.
When --mock is used, calibration falls back to a heuristic: tasks whose prompt contains only generic verbs ("explain", "describe", "list") without skill-specific terminology are filtered. This is less precise than LLM-based calibration but costs zero API calls.
The calibration step runs after deduplication and before the final cap of 10 tasks per suite.
| Skill | Relationship | When to prefer over skill-forge |
|---|---|---|
improvement-evaluator | Downstream consumer: runs the generated task_suite.yaml and reports pass/fail per task | You already have both SKILL.md and task_suite.yaml, just need to run them |
improvement-orchestrator | Drives the full generate-evaluate-improve loop; skill-forge is one step in this loop | You want automatic multi-round improvement, not just test generation |
improvement-generator | Generates improvement candidates (patches) for a SKILL.md | You want to improve an existing skill's prose/structure, not generate tests |
improvement-discriminator | Scores improvement candidates via multi-reviewer blind panel | You need to judge which candidate patch is best |
skill-creator | Manual SKILL.md authoring guide with templates | You prefer hand-writing the SKILL.md rather than generating it |
skill-distill | Merges multiple overlapping skills into one distilled skill | You have redundant skills to consolidate, not a new skill to create |