Install
openclaw skills install skill-engineerDesign, test, review, and maintain agent skills for OpenClaw systems using multi-agent iterative refinement. Orchestrates Designer, Reviewer, and Tester suba...
openclaw skills install skill-engineerOwn the full lifecycle of agent skills in your OpenClaw agent kit. The entire multi-agent workflow depends on skill quality — a weak skill produces weak results across every run.
Core principle: Builders don't evaluate their own work. This skill enforces separation of concerns through a multi-agent architecture where design, review, and testing are performed by independent subagents.
Source: Anthropic "Improving skill-creator" (2026-03-03)
Skills fall into two categories. This distinction drives design decisions, testing strategy, and lifecycle management.
The model can't do it well alone — the skill injects techniques, patterns, or constraints that produce better output than prompting alone.
Examples: Document creation skills (PDF generation), complex formatting, specialized analysis pipelines.
Testing focus: Monitor whether the base model has caught up. If the base model passes your evals without the skill loaded, the skill's techniques have been incorporated into model default behavior. The skill isn't broken — it's no longer necessary.
Lifecycle: These skills may "retire" as models improve. Build evals that can detect when retirement is appropriate.
The model can already do each step — the skill sequences operations according to your team's specific process.
Examples: NDA review against set criteria, weekly report generation from specific data sources, brand compliance checks.
Testing focus: Verify the skill faithfully reproduces your actual workflow, not the model's "free improvisation." Fidelity to process is the metric.
Lifecycle: These skills are durable — they encode organizational knowledge that doesn't change with model capability. They need maintenance when processes change, not when models change.
When the Designer begins work, classify the skill:
| Classification | Design priority | Test priority | Retirement risk |
|---|---|---|---|
| Capability uplift | Technique precision | Base model comparison | High — monitor model progress |
| Encoded preference | Process fidelity | Workflow reproduction | Low — tied to org process |
This skill requires the following to be installed and available:
| Dependency | Type | Purpose | Install from |
|---|---|---|---|
deepwiki | Skill | Query OpenClaw source for current API behavior | liaosvcaf/openclaw-skill-deepwiki |
| Vector memory DB | OpenClaw feature | Semantic search across session history, notes, and memory files | Enable in openclaw.json (memory.enabled: true) |
Before starting any skill design or update session, verify both are available:
# Check deepwiki
ls ~/.openclaw/skills/deepwiki/deepwiki.sh || ls ~/.openclaw/workspace-*/skills/deepwiki/deepwiki.sh
# Check vector memory (should return results, not empty)
# Use the memory_search tool with a known topic from recent sessions
If deepwiki is missing, install from liaosvcaf/openclaw-skill-deepwiki.
If vector memory returns no results on known topics, check that memory.enabled is true in openclaw.json and that indexing has run.
DeepWiki: OpenClaw APIs are version-specific. Without DeepWiki, skills are written against memory of past behavior — which drifts as OpenClaw updates. DeepWiki grounds skill content in actual source code. A skill engineer without DeepWiki is working blind.
Vector memory DB: Session history, Obsidian notes, and past decisions are indexed here. Without it, the agent falls back to manual file search — slower, less accurate, and misses cross-document connections. Critical context from past sessions (installation guides, design decisions, pitfalls) lives in this index.
Before searching files manually, always query the vector memory database first. It indexes session history, Obsidian notes, and memory files — and finds cross-document connections that manual search misses.
When to query vector memory:
How to query correctly:
memory_search("your query here", maxResults=5)
Critical rule: try multiple queries before giving up.
If the first query returns empty, do NOT fall back to manual file search immediately. Try at least 3 different phrasings:
| First query fails | Try instead |
|---|---|
| "Docker OpenClaw installation" | "dockerized openclaw Titan" |
| "dockerized openclaw Titan" | "openclaw isolation install guide" |
| Still empty | Then fall back to manual file search |
Lesson learned (2026-03-03): When asked to find Docker/OpenClaw installation notes, memory_search returned empty on the first query and the agent immediately switched to manual SQLite/file search. The correct approach was to try different query phrasings — the second attempt ("dockerized OpenClaw installation Titan setup") returned 5 relevant results directly from indexed Obsidian notes. Manual file search is a last resort, not a first response.
OpenClaw APIs, skill loading behavior, subagent mechanics, and frontmatter fields are version-specific. Information in this skill or any skill referencing OpenClaw internals may be outdated.
ALWAYS query DeepWiki when:
sessions_spawn, tool calls, or OpenClaw-specific APIsHow to check:
# Check current OpenClaw version
openclaw --version
# Query DeepWiki for current behavior
~/.openclaw/skills/deepwiki/deepwiki.sh ask openclaw/openclaw "YOUR QUESTION"
Do NOT rely on memory or this skill's documented behavior without verifying when the topic is OpenClaw internals. DeepWiki is grounded in the actual source code. This skill's documentation is not.
Verification checklist before shipping any skill that references OpenClaw internals:
openclaw --version against version tags in the skillThis skill produces validated skill artifacts (SKILL.md, skill.yml, README.md, tests, scripts). Once artifacts pass quality gates, responsibility transfers to whatever system handles publishing and deployment.
A skill development cycle is considered successful when:
If any criterion fails, the skill returns to the Designer for revision.
When invoking this skill, the orchestrator must gather:
| Input | Description | Required | Source |
|---|---|---|---|
| Problem description | What capability or workflow needs to be enabled | Yes | User conversation |
| Target audience | Which agent(s) will use this skill | Yes | User or inferred |
| Expected interactions | With users, APIs, files, MCP servers, other skills | Yes | Requirements discussion |
| Inputs/Outputs | What data the skill receives and produces | Yes | Requirements discussion |
| Constraints | Performance limits, security requirements, dependencies | No | User or system |
| Prior feedback | Review or test reports from previous iterations | No | Previous Reviewer/Tester |
| Existing artifacts | If refactoring/maintaining an existing skill | No | File system |
Example requirements gathering:
User: "I need a skill for analyzing competitor websites"
Orchestrator gathers:
- Problem: Automate competitor analysis with structured output
- Audience: research-agent
- Interactions: web_fetch, browser tool, writes markdown reports
- Inputs: competitor URLs, analysis criteria
- Outputs: comparison table, insights markdown
- Constraints: must complete in <60s per site
These inputs are then passed to the Designer to begin the design process.
The skill-engineer uses a three-role iterative architecture. The orchestrator spawns subagents for each role and never does creative or evaluation work directly.
Two architecture modes are available. Choose based on complexity:
Mode A: Director-Controlled (simple/short skill work) Use when: ≤2 phases, <10 minutes total, user interaction needed between phases (e.g., quick fixes, single-skill reviews).
Director/Orchestrator (main agent, depth 0)
├─ Spawn ──→ Designer (depth 1)
├─ Spawn ──→ Reviewer (depth 1)
└─ Spawn ──→ Tester (depth 1)
Risk: announce-to-action gap — if user sends a message while waiting for a subagent, the main agent may handle that instead of chaining the next phase. Mitigate with cron safety net (see below).
Mode B: Orchestrator Subagent Pattern (complex/long skill work) Use when: 3+ phases, >10 minutes, pipeline must not stall, parallel workers needed.
Director (user-facing, depth 0)
└── Orchestrator (pipeline owner, depth 1)
├─ Spawn ──→ Designer (depth 2)
├─ Spawn ──→ Reviewer (depth 2)
└─ Spawn ──→ Tester (depth 2)
The Director spawns a single Orchestrator subagent with the full task description. The Orchestrator owns the entire Design→Review→Test loop without yielding control between phases. User messages go to the Director; the pipeline runs uninterrupted.
Required config for Mode B:
{
"agents": { "defaults": { "subagents": { "maxSpawnDepth": 2 } } }
}
Why Mode B is superior for complex work:
Reference: orchestrator-subagent-pattern-2026-02-28.md (Obsidian notes) — documented after a real 70-minute pipeline stall incident.
When using Mode A, set a cron safety net after each spawn to catch announce-to-action failures:
"Check if [designer/reviewer/tester] subagent has completed. If so and next phase not started, resume pipeline."
(fires 15 min after spawn)
Designer → Reviewer ──pass──→ Tester ──pass──→ Ship
│ │
fail fail
│ │
▼ ▼
Designer revises Designer revises
│ │
▼ ▼
Reviewer Reviewer + Tester
│
(max 3 iterations, then fail)
Exit conditions:
After 3 failed iterations, the orchestrator must:
Never: Continue past 3 iterations or ship a skill that hasn't passed quality gates.
Version note: Verified against OpenClaw v2026.2.26. API may change.
In OpenClaw, subagents are spawned using the sessions_spawn tool (not a CLI command). Subagents run in isolated sessions, announce results back to the requester's channel when complete, and are auto-archived after 60 minutes by default.
Key constraints on subagents:
maxSpawnDepth: 2 is configured)SOUL.md, IDENTITY.md, or USER.md — only AGENTS.md and TOOLS.mdrunTimeoutSeconds to prevent hanging (900s for Designer, 600s for Reviewer/Tester)ANNOUNCE_SKIP to suppressThis is the most important architectural decision. Understand it before proceeding.
The natural instinct is to have the main agent (you) directly manage the Design→Review→Test loop:
Main agent
├── spawns Designer → waits for announce → spawns Reviewer → waits → spawns Tester
This breaks in three ways:
Announce-to-action gap: When a subagent finishes, OpenClaw sends a completion announce that triggers a new LLM turn. The LLM may report results to the user and stop — treating the announce as informational rather than a pipeline trigger. There is no mechanism that forces the next action.
Context loss: Each new turn is a fresh LLM call. Between subagent completion and the next turn, there is no persistent state machine tracking "we're in iteration 2, reviewer passed, now run Tester." The agent must re-derive this from files every time — fragile over 3+ iterations.
User message interruption: If the user sends a message while the pipeline is between phases, the main agent handles that message instead of continuing. The pipeline stalls silently until the user notices.
Real incident: A book-writer pipeline stalled for 70 minutes because a research subagent completed and announced back, but the Director reported results to the user and stopped — never spawning the writing phase. (2026-02-28)
Add an intermediate Orchestrator subagent that owns the pipeline. The main agent becomes the Director — it talks to the user. The Orchestrator does the pipeline work. They don't share context.
Director (main agent, depth 0) ←→ User
│
└── Orchestrator (subagent, depth 1) — owns Design→Review→Test loop
├── Designer (depth 2)
├── Reviewer (depth 2)
└── Tester (depth 2)
Why this works:
Required config (add to openclaw.json before using this pattern):
{
"agents": { "defaults": { "subagents": { "maxSpawnDepth": 2 } } }
}
| Situation | Use | Why |
|---|---|---|
| Quick fix, single skill review, <10 min | Director-only (depth 1 subagents) | Simpler, fewer spawns |
| Full design cycle (Design+Review+Test) | Director + Orchestrator (depth 2) | Pipeline cannot afford to stall |
| Any pipeline with 3+ sequential phases | Director + Orchestrator (depth 2) | Announce-to-action gap becomes critical |
| maxSpawnDepth not set to 2 | Director-only with cron safety net | No choice — see fallback below |
If maxSpawnDepth: 2 is not configured, use Director-only mode but add a cron safety net after each subagent spawn:
After spawning Designer, register a cron job:
"Check if Designer has completed (look for output at /path/to/skill/SKILL.md).
If completed and Reviewer not yet started, spawn Reviewer now."
(fires 15 minutes after spawn)
This mitigates but does not eliminate the announce-to-action gap.
The Director (main agent) talks to the user and kicks off the pipeline. It does NOT do design, review, or testing work.
~/.openclaw/skills/deepwiki/deepwiki.sh ask openclaw/openclaw "RELEVANT QUESTION"
The Orchestrator (depth-1 subagent in Mode B, or main agent in fallback mode) owns the Design→Review→Test loop. It does NOT write skill content or evaluate quality — it only coordinates.
sessions_spawn(
task="Act as Designer. Requirements: [...]. Write artifacts to /path/to/skill/",
label="skill-v1-designer",
runTimeoutSeconds=900
)
sessions_spawn(
task="Act as Reviewer. Evaluate skill at /path/to/skill/ using rubric: [...]. Score all 33 checks.",
label="skill-v1-reviewer",
runTimeoutSeconds=600
)
sessions_spawn(
task="Act as Tester. Run self-play on skill at /path/to/skill/. Test triggers, functional steps, edge cases.",
label="skill-v1-tester",
runTimeoutSeconds=600
)
Every shipped skill must include a quality scorecard in its README.md. This is the Reviewer's final scores, added by the Orchestrator before delivery:
## Quality Scorecard
| Category | Score | Details |
|----------|-------|---------|
| Completeness (SQ-A) | 7/7 | All checks pass |
| Clarity (SQ-B) | 4/5 | Minor ambiguity in edge case handling |
| Balance (SQ-C) | 4/4 | AI/script split appropriate |
| Integration (SQ-D) | 4/4 | Compatible with standard agent kit |
| Scope (SCOPE) | 3/3 | Clean boundaries, no leaks |
| OPSEC | 2/2 | No violations |
| References (REF) | 3/3 | All sources cited |
| Architecture (ARCH) | 2/2 | Separation of concerns maintained |
| **Total** | **29/30** | |
*Scored by skill-engineer Reviewer (iteration 2)*
This scorecard serves as a quality certificate. Users can assess skill quality before installing.
The orchestrator manages git commits throughout the workflow:
When to commit:
git add . && git commit -m "feat: initial design for <skill-name>"git add . && git commit -m "fix: address review issues (iteration N)"git add README.md && git commit -m "docs: add quality scorecard for <skill-name>"When to push:
git push origin mainBranch strategy:
The orchestrator must handle technical failures gracefully:
| Failure Type | Detection | Response |
|---|---|---|
| Git push fails | Exit code ≠ 0 | Retry once. If fails again, report to user: "Cannot push to remote. Check network/permissions." |
| OPSEC scan script missing | File not found | Skip OPSEC automated check, but flag in review: "Manual OPSEC review required — script not found." |
| File write errors | Permission denied | Report: "Cannot write to [path]. Check file permissions." Fail workflow. |
| Subagent crashes | Timeout or error | Log the error, attempt retry once. If fails again, report: "Subagent failed. Manual intervention required." |
| Review score = 0 | All checks fail | Report: "Skill failed all quality checks. Requirements may be unclear or skill design is fundamentally flawed. Recommend starting over." |
Retry logic:
Fail-fast rules:
Orchestrator workload: Coordinating Designer/Reviewer/Tester across 1-3 iterations can be complex, especially for large skills (1000+ lines). The orchestrator manages:
Token considerations: A full 3-iteration cycle can consume 50k-150k tokens depending on skill complexity. For extremely complex skills, consider:
If orchestrator feels overwhelmed: This is a signal that the skill being designed may be too complex. Revisit the scope definition and consider decomposition.
Each subagent receives only what it needs:
| Role | Receives | Does NOT Receive |
|---|---|---|
| Designer | Requirements, prior feedback (if any), design principles | Reviewer rubric internals |
| Reviewer | Skill artifacts, quality rubric, scope boundaries | Requirements discussion |
| Tester | Skill artifacts, test protocol | Review scores |
Purpose: Generate and revise skill content.
For complete Designer instructions, see: references/designer-guide.md
Inputs: Requirements, design principles, feedback (on iterations 2+)
Outputs: SKILL.md, skill.yml, README.md, tests/, scripts/, references/
Key constraints:
Design principles:
Purpose: Independent quality evaluation. The Reviewer has never seen the requirements discussion — it evaluates artifacts on their own merits.
For complete Reviewer rubric and scoring guide, see: references/reviewer-rubric.md
Inputs: Skill artifacts, quality rubric, scope boundaries
Outputs: Review report with scores, verdict (PASS/REVISE/FAIL), issues, strengths
Quality rubric (33 checks total):
Scoring thresholds:
Pre-review: Run deterministic validation scripts before manual evaluation
Purpose: Empirical validation via self-play. The Tester loads the skill and attempts realistic tasks.
For complete Tester protocol, see: references/tester-protocol.md
Inputs: Skill artifacts, test protocol
Outputs: Test report with trigger accuracy, functional test results, edge cases, blocking/non-blocking issues, verdict (PASS/FAIL)
Test protocol:
Issue classification:
Pass criteria: No blocking issues + ≥90% trigger accuracy
The agent that DESIGNS a skill must NOT be the same agent that AUDITS it in the same session.
This is a hard architectural rule, not a guideline. When the same agent designs and audits in one session, it creates structural circularity: the designer unconsciously frames evaluation in terms of their own intentions, missing gaps that a fresh reader would catch.
Enforcement:
openclaw agent --session-id <unique-id> (Option 2 spawning) when auditing a skill the current session has designed.Why this matters:
Example — correct:
# Session A: Designer work
sessions_spawn(
task="Design a skill for X. Write artifacts to /path/to/skill/",
label="skill-v1-designer",
runTimeoutSeconds=900
)
# Session B: Audit (fresh session, no context from Session A)
sessions_spawn(
task="Audit the skill at /path/to/skill/ using the reviewer rubric.",
label="skill-v1-auditor",
runTimeoutSeconds=600
)
Example — incorrect:
[Session A]
1. Design the skill...
2. Now let me review the skill I just designed... ← VIOLATION
Source: Anthropic "Improving skill-creator" (2026-03-03). Adapted for OpenClaw skill-engineer.
Evals turn "seems to work" into "verified to work." Every shipped skill should have persistent evals that can be re-run after model updates, skill edits, or environment changes.
An eval consists of:
Store evals in the skill's tests/ directory:
tests/
├── evals.json # Eval definitions
├── benchmarks/ # Benchmark run results (timestamped)
└── comparisons/ # A/B comparison results
| Type | Purpose | When to run |
|---|---|---|
| Regression eval | Catch quality drops after changes | After every skill edit or model update |
| Capability eval | Detect if base model has outgrown the skill | Monthly, or after major model releases |
| Trigger eval | Verify skill fires correctly | After description changes |
Run standardized assessments tracking:
Store benchmark results with timestamps for trend tracking:
{
"timestamp": "2026-03-04T12:00:00Z",
"skill": "my-skill",
"model": "claude-sonnet-4-5",
"pass_rate": 0.85,
"avg_time_s": 12.3,
"avg_tokens": 4200,
"evals_run": 10
}
Compare two skill versions — or skill vs. no skill — using a blind judge:
When to use:
Spawning a Comparator:
sessions_spawn(
task="You are a blind comparator. You will receive Output A and Output B for the same task. Score each on [dimensions]. You do NOT know which version produced which output. Be objective.",
label="skill-comparator",
runTimeoutSeconds=300
)
Skill descriptions determine trigger accuracy. As skill count grows, description precision becomes critical:
Tuning protocol:
description field to be more preciseTarget: ≥90% trigger accuracy on sample prompts. Anthropic's internal testing improved 5 out of 6 public skills using this method.
Skills are not forever. Capability uplift skills may become unnecessary as models improve.
Retirement signal: Base model passes ≥80% of the skill's evals without the skill loaded.
Retirement process:
Track in audit reports:
## Retirement Candidates
| Skill | Capability Eval (no skill) | Comparator Result | Recommendation |
|-------|---------------------------|-------------------|----------------|
| pdf-creator | 85% pass | No significant difference | Retire |
Periodic full audit of the agent kit:
# Agent Kit Audit Report
**Date:** [date]
**Skills audited:** [count]
## Skill Inventory
| # | Skill | Agent | Quality Score | Status |
|---|-------|-------|--------------|--------|
| 1 | [name] | [agent] | X/33 | Deploy/Revise/Redesign |
## Issues Found
1. ...
## Recommendations
1. ...
## Action Items
| # | Action | Priority | Owner |
|---|--------|----------|-------|
Maintain a map of how skills interact:
orchestrator-agent (coordinates workflow)
├── content-creator (writes content)
│ └── consumes: research outputs, review feedback
├── content-reviewer (reviews content)
│ └── produces: review reports
├── research-analyst (researches topics)
│ └── produces: research consumed by content-creator
├── validator (validates outputs)
└── skill-engineer (this skill — meta)
└── consumes: all skills for audit
Adapt this to your specific agent architecture.
Version note: This section is based on OpenClaw v2026.2.26. Skill system behavior (frontmatter fields, loading precedence, subagent APIs) may change across versions. Verify against source or DeepWiki when upgrading.
A skill is a directory containing at minimum a SKILL.md file:
my-skill/
├── SKILL.md # Required: frontmatter + instructions
├── skill.yml # Optional: ClawhHub publish metadata
├── README.md # Optional: human-facing documentation
├── scripts/ # Optional: deterministic helper scripts
├── tests/ # Optional: test cases and fixtures
└── references/ # Optional: detailed linked documentation
Required fields:
---
name: skill-name # kebab-case, no spaces/capitals/underscores
description: | # What it does + when to use it + trigger phrases
[What it does]. Use when user [trigger phrases]. [Key capabilities].
---
Full supported fields:
---
name: skill-name
description: ...
homepage: https://... # URL for Skills UI
user-invocable: true # Expose as slash command (default: true)
disable-model-invocation: false # Exclude from model prompt (default: false)
command-dispatch: tool # Bypass model, dispatch to tool directly
command-tool: tool-name # Tool to invoke when command-dispatch is set
command-arg-mode: raw # Argument forwarding mode (default: raw)
metadata: {"openclaw": {"always": true, "emoji": "🔧", "os": ["darwin","linux"], "requires": {"bins": ["curl","python3"]}, "primaryEnv": "MY_API_KEY"}}
---
metadata.openclaw load-time gates:
| Field | Purpose |
|---|---|
always: true | Always include, skip all other gates |
emoji | Emoji shown in macOS Skills UI |
os | Limit to platforms: darwin, linux, win32 |
requires.bins | All binaries must exist on PATH |
requires.anyBins | At least one binary must exist |
requires.env | Environment variables must exist |
requires.config | openclaw.json paths must be truthy |
primaryEnv | Links to skills.entries.<name>.apiKey in config |
Skills are loaded from these locations (highest → lowest priority):
<workspace>/skills/ — agent-specific, highest precedence~/.openclaw/skills/ — shared across all agents on machineskills.load.extraDirs in openclaw.json — additional directoriesopenclaw.plugin.json| Location | Use when |
|---|---|
<workspace>/skills/ | Skill is specific to one agent's role; under active development |
~/.openclaw/skills/ | Skill should be available to all agents on this machine |
OpenClaw builds a system prompt with a compact XML list of available skills (name, description, location). The model reads this list and decides which skills to load. Skills are NOT auto-injected — the model must explicitly read the SKILL.md when needed.
Trigger accuracy goal: ≥90% (skill loads when relevant, does NOT load when irrelevant).
To inventory all skills on a machine:
find ~/.openclaw/ -name "SKILL.md" -not -path "*/node_modules/*" | sort
No persistent configuration required. The skill uses tools available in the agent's environment.
| Requirement | Description |
|---|---|
deepwiki skill | Query OpenClaw source for current API behavior (liaosvcaf/openclaw-skill-deepwiki) |
| Vector memory | Semantic search across session history (memory.enabled: true in openclaw.json) |
gh CLI | GitHub repo creation and visibility changes for release pipeline |
clawhub CLI | Publish skills to ClawhHub registry (npm i -g clawhub) |
See references/designer-guide.md for full environment setup.