Skill Forge

v1.0.1

当需要为已有 Skill 自动生成 task_suite.yaml 测试任务集、从 skill_spec.yaml 生成完整 SKILL.md + task_suite、或一键走完「生成 → 评估 → 改进」全链路时使用。读取 SKILL.md 中的 frontmatter/When to Use/exampl...

⭐ 0· 101·0 current·0 all-time

by_silhouette@lanyasheng

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for lanyasheng/auto-skill-forge.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Skill Forge" (lanyasheng/auto-skill-forge) from ClawHub.
Skill page: https://clawhub.ai/lanyasheng/auto-skill-forge
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install auto-skill-forge

ClawHub CLI

Package manager switcher

npx clawhub@latest install auto-skill-forge

Security Scan

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name/description match the code: scripts parse SKILL.md or a skill_spec.yaml and generate task_suite.yaml and SKILL.md. Required binaries/env/paths are none, which is appropriate for a pure generator. The ability to call improvement-evaluator and improvement-orchestrator (if present) is coherent with the documented --evaluate/--auto-improve modes.

ℹ

Instruction Scope

Runtime behavior stays within the stated scope: reading SKILL.md or a spec, generating SKILL.md and task_suite.yaml, and writing them to the specified output directory. One noteworthy behavior: when the user passes --evaluate or --auto-improve the CLI will look for and execute scripts under Path.home()/.claude/skills/improvement-evaluator and improvement-orchestrator. Executing those scripts is expected for evaluation/orchestration but means the skill can cause arbitrary local code to run if those scripts exist.

✓

Install Mechanism

No install spec — instruction-only with included Python source. Nothing is downloaded or written by an installer. Code is plain Python (no obfuscated downloads).

✓

Credentials

The skill declares no environment variables, no credentials, and no config paths. The code also doesn't read secrets from env vars. It only reads/writes files in the provided skill/output directories and conditionally checks for evaluator/orchestrator scripts under the user's home; those checks are proportional to the documented features.

✓

Persistence & Privilege

always is false and the skill does not request permanent system presence. It writes generated files into the output directory (expected). The only elevated action is launching local evaluator/orchestrator scripts when requested by flags — this happens only if the user supplies --evaluate/--auto-improve.

Assessment

This skill appears coherent with its description and has no secret or network access. Before using it: (1) Review any SKILL.md or skill_spec.yaml you pass in to ensure they don't contain secrets (those files' content will be embedded into generated prompts/tasks). (2) Only use --evaluate or --auto-improve if you trust any installed improvement-evaluator and improvement-orchestrator scripts at ~/.claude/skills/ because the CLI will execute those scripts if present. (3) Inspect the generated task_suite.yaml before running it with other tools. If you want extra assurance, run the generator with --mock to avoid LLM calls and avoid the evaluate/auto-improve flags until you vet the local evaluator/orchestrator code.

Like a lobster shell, security has layers — review code before you run it.

latestvk97141rcqgexzq0wbv4bkfjhxh84cbd6

101downloads

0stars

2versions

Updated 3w ago

v1.0.1

MIT-0

Skill Forge

Generate Skills from requirements AND generate task_suite.yaml for evaluation.

The primary value of this skill is task_suite generation -- turning a SKILL.md into a structured test harness that improvement-evaluator can run. Secondary value is generating SKILL.md from a structured skill_spec.yaml.

Key differentiator: Skill Forge does not merely scaffold a skeleton; it performs static analysis of the SKILL.md to extract testable claims (from five distinct sources) and assigns the appropriate judge type per task, producing a suite that is immediately runnable by improvement-evaluator.

When to Use

Add tests to an existing Skill -- You have a SKILL.md but no task_suite.yaml. Run --from-skill to analyze the SKILL.md and generate a test suite automatically. The generator extracts scenarios from five sections of the document and assigns the right judge per task.
Create a new Skill from scratch -- You have a skill_spec.yaml describing what the skill should do. Run --from-spec to generate both a complete SKILL.md (with frontmatter, sections, examples) and a matching task_suite.yaml.
Full lifecycle -- Generate a skill, test it with --evaluate, and optionally improve it with --auto-improve in one pass. Combines skill-forge, improvement-evaluator, and improvement-orchestrator into a single command.
Validate coverage of an existing suite -- Compare the generated task_suite against a hand-written one to find untested scenarios.

When NOT to Use

You just want to evaluate an existing skill with an existing task_suite.yaml → use improvement-evaluator
You just want to improve a skill that already has tests → use improvement-orchestrator
You want to manually write a SKILL.md with templates → use skill-creator
You want to score improvement candidates (not generate tests) → use improvement-discriminator
You want to merge overlapping skills into one consolidated skill → use skill-distill

Modes

Mode A: `--from-skill` (Generate task suite for existing SKILL.md)

# Generate test suite for an existing skill
python3 scripts/forge.py --from-skill /path/to/skill-dir --output /path/to/output

# Generate and immediately evaluate
python3 scripts/forge.py --from-skill /path/to/skill-dir --output /path/to/output --evaluate

Reads the SKILL.md, extracts scenarios from five sources in priority order:

Frontmatter (description, triggers) → 1 core-capability test
"When to Use" bullets → up to 3 positive use-case tests
<example> tags → up to 2 keyword-match tests (uses ContainsJudge)
<anti-example> tags → up to 2 negative tests (should avoid bad patterns)
Output format/CLI sections → 1 format compliance test

Produces task_suite.yaml in the output directory. Example generated output:

skill_id: release-notes-generator
version: "1.0"
generated_by: skill-forge
tasks:
  - id: release-notes-generator-core-capability
    description: "Test core capability described in skill description"
    prompt: "You are an AI assistant with this skill loaded..."
    judge:
      type: llm-rubric
      rubric: "The output should demonstrate the capability..."
      pass_threshold: 0.7
    timeout_seconds: 120
    source: frontmatter.description
  - id: release-notes-generator-use-case-01
    description: "Use case: Generate notes from git log between tags"
    prompt: "Scenario: Generate notes from git log..."
    judge:
      type: llm-rubric
      rubric: "The output should address this use case..."
      pass_threshold: 0.6
    timeout_seconds: 120
    source: when_to_use

Mode B: `--from-spec` (Generate skill + task suite from spec)

# Generate complete skill from a spec file
python3 scripts/forge.py --from-spec spec.yaml --output /path/to/output

# Generate, evaluate, and auto-improve if below SOLID grade
python3 scripts/forge.py --from-spec spec.yaml --output /path/to/output --auto-improve

Reads a skill_spec.yaml (see references/spec-format.md) and generates:

A complete SKILL.md with frontmatter, When to Use / When NOT to Use, examples, output format
A task_suite.yaml derived from the generated SKILL.md (same five-source extraction)

The spec format requires only name and purpose; optional fields (inputs, outputs, quality_criteria, domain_knowledge, reference_skills) enrich the generated SKILL.md. Example minimal spec:

name: release-notes-generator
purpose: Generate structured release notes from git commit history

inputs:
  - name: commits
    type: git-log
    description: "Git commit log between two tags"

outputs:
  - name: release-notes
    format: markdown
    description: "Structured release notes with sections"

quality_criteria:
  - name: completeness
    description: "All commits accounted for in the notes"
    weight: 0.3

Common Flags

Flag	Effect
`--mock`	Use mock LLM (for testing without API calls)
`--evaluate`	Run `improvement-evaluator` after generation (requires it installed)
`--auto-improve`	Run `improvement-orchestrator` if score below SOLID (requires it installed)

Output Artifacts

Request	Deliverable	Location
`--from-skill`	`task_suite.yaml` with 5-10 test tasks	`<output>/task_suite.yaml`
`--from-spec`	`SKILL.md` + `task_suite.yaml`	`<output>/<name>/SKILL.md`, `<output>/<name>/task_suite.yaml`
`--evaluate`	Evaluation report (pass/fail per task, aggregate pass rate)	stdout + `<output>/evaluation_report.json`
`--auto-improve`	Improved SKILL.md (if score was below SOLID)	in-place update of `SKILL.md`

All generated YAML files use allow_unicode: True and default_flow_style: False for human readability. Files are written atomically (write-then-rename) to prevent corruption on crash.

Task Generation Strategy

The generator extracts test scenarios from 5 sources in the SKILL.md:

Frontmatter description → 1 core-capability test
"When to Use" section → up to 3 positive use-case tests
<example> tags → up to 2 keyword-match tests
<anti-example> tags → up to 2 anti-pattern avoidance tests
Output format section → 1 format compliance test

Tasks are deduplicated and capped at 10 per suite.

Judge Selection

Scenario with specific keywords/outputs → ContainsJudge
Scenario requiring quality/style assessment → LLMRubricJudge
Scenario with structured output → PytestJudge (if test script can be generated)

Harness Pattern Tasks (for scripted skills)

When the target skill has a scripts/ directory, forge auto-generates additional test tasks checking execution-harness pattern adoption:

Timeout handling: Does the skill handle subprocess.TimeoutExpired?
Atomic writes: Does it use write_json/write_text from lib/common (atomic write-then-rename)?
Backup/rollback: Does it create backups before file modifications?
Error escalation: Does it have graduated error handling (not just crash-on-first-failure)?
State persistence: Does it write recoverable state for crash recovery?

These tasks use ContainsJudge to grep the skill's Python source code. They only apply to orchestration/tool-type skills — pure-text knowledge skills skip this category.

Why Null-Skill Calibration

A generated task suite is only useful if it measures the skill's contribution, not the base LLM's general ability. Without calibration, a naive suite can report 80%+ pass rates even when the SKILL.md adds zero value -- because the LLM already knows how to answer those questions.

Null-skill calibration addresses this by running every candidate task against a "null skill" (empty context, no SKILL.md loaded). Any task the null skill passes trivially is filtered out before the final suite is emitted. This ensures that every surviving task genuinely requires the knowledge or structure encoded in the SKILL.md.

Tradeoff: Null-skill calibration adds one extra LLM call per candidate task (or a heuristic keyword check in --mock mode). For a typical 10-task candidate set, this means ~10 additional calls during suite generation. The cost is justified because an uncalibrated suite gives false confidence: a skill that scores 9/10 on easy tasks looks "SOLID" but may add no value over a bare model. Calibrated suites reliably distinguish genuine skill contributions from baseline LLM capability.

When --mock is used, calibration falls back to a heuristic: tasks whose prompt contains only generic verbs ("explain", "describe", "list") without skill-specific terminology are filtered. This is less precise than LLM-based calibration but costs zero API calls.

The calibration step runs after deduplication and before the final cap of 10 tasks per suite.

Related Skills

Skill	Relationship	When to prefer over skill-forge
`improvement-evaluator`	Downstream consumer: runs the generated `task_suite.yaml` and reports pass/fail per task	You already have both SKILL.md and task_suite.yaml, just need to run them
`improvement-orchestrator`	Drives the full generate-evaluate-improve loop; skill-forge is one step in this loop	You want automatic multi-round improvement, not just test generation
`improvement-generator`	Generates improvement candidates (patches) for a SKILL.md	You want to improve an existing skill's prose/structure, not generate tests
`improvement-discriminator`	Scores improvement candidates via multi-reviewer blind panel	You need to judge which candidate patch is best
`skill-creator`	Manual SKILL.md authoring guide with templates	You prefer hand-writing the SKILL.md rather than generating it
`skill-distill`	Merges multiple overlapping skills into one distilled skill	You have redundant skills to consolidate, not a new skill to create

Comments

Loading comments...

Skill Forge

Install

Install with OpenClaw

CLI Commands

Skill Forge

When to Use

When NOT to Use

Modes

Mode A: --from-skill (Generate task suite for existing SKILL.md)

Mode B: --from-spec (Generate skill + task suite from spec)

Common Flags

Output Artifacts

Task Generation Strategy

Judge Selection

Harness Pattern Tasks (for scripted skills)

Why Null-Skill Calibration

Related Skills

Comments

Mode A: `--from-skill` (Generate task suite for existing SKILL.md)

Mode B: `--from-spec` (Generate skill + task suite from spec)