🌿 Skill Garden — Skill Evolution Engine
"Every skill should get better the more you use it."
Philosophy
Skill Garden treats skill improvement as a continuous, invisible process — not a special operation. It runs passively in the background, accumulating observations from every skill invocation, then periodically synthesizes them into concrete improvements.
Key design principles:
- Token-efficient: Lightweight structured logs, batch processing, no real-time overhead
- User-in-control: High-confidence changes auto-apply; uncertain ones ask first
- Transparent: Every change is explained, nothing happens silently without explanation
- Self-contained: Manages its own memory, dashboard, proposals, and cron schedule
Three-Layer Architecture
Layer 1 — Passive Observation (near-zero token cost)
Every skill invocation → 1-line structured log entry
Abnormal outcomes (FAIL/SLOW) → detailed log with evidence
Layer 2 — Weekly Batch Analysis (isolated agent, ~5-15 min)
Read all accumulated logs
Run evaluation engine across 6 dimensions
Generate specific improvement proposals
Layer 3 — Targeted Modification (low frequency, high precision)
Confidence ≥ 90% → apply immediately, notify user
Confidence 70–89% → apply with [experimental] tag
Confidence 50–69% → write to proposals, ask user
Confidence < 50% → log as observation only
When This Skill Activates
Trigger 1 — After every skill use (automatic, passive):
When any skill finishes executing, immediately log the outcome using the format in references/usage_tracker.md. This is the most important layer — it costs almost nothing and feeds everything else.
Trigger 2 — On user request:
- "grow this skill" / "improve skill" / "optimize skill"
- "why did this fail" / "analyze this skill"
- "run skill analysis" / "check my skills"
- "skill health" / "skill dashboard"
Trigger 3 — On schedule (automatic, every Sunday 20:00):
An isolated agent runs batch_analyze.py and generate_report.py, applies high-confidence improvements, and sends you a summary.
Layer 1: Passive Observation
After any skill finishes (any outcome: OK, FAIL, PARTIAL, SLOW, SKIP), immediately write a structured log. Use log_insight.py or write directly to the skill's references/usage_log.md.
Log Entry Format
For OK outcomes with nothing notable (minimal tokens):
## YYYY-MM-DD HH:MM
Trigger: [trigger in ≤10 words]
Outcome: OK
Signal: [one-line finding or "No issues"]
For PARTIAL, FAIL, SLOW outcomes (always log all fields):
## YYYY-MM-DD HH:MM
### Trigger
[What the user asked for, ≤10 words]
### Outcome
OK | PARTIAL | FAIL | SLOW | SKIP
### Signal
[One specific phrase: what this tells us about the skill]
Examples:
- "Covered: standard use case works perfectly"
- "Missing: error handling for network timeouts"
- "Ambiguous: step 3 could be interpreted two ways"
- "Outdated: API version in skill doesn't match current"
### Evidence
[1-2 sentences. Quote or paraphrase exact output/error. Be specific.]
### Flags
[Comma-separated tags: [new_trigger] [missing_coverage] [confusing_step]
[outdated_info] [token_heavy] [edge_case] [user_workaround_used]
[config_stale] [api_change] [Covered] [success_boost]]
Using log_insight.py
# Quick OK log (minimal)
python3 ~/.openclaw/workspace/skills/skill-garden/scripts/log_insight.py \
--skill github-trending-summary \
--trigger "daily top 5 repos" \
--outcome OK \
--signal "Covered: standard case"
# Detailed failure log
python3 ~/.openclaw/workspace/skills/skill-garden/scripts/log_insight.py \
--skill banxuebang-helper \
--trigger "check homework" \
--outcome FAIL \
--signal "Missing: semester selector not dynamic" \
--evidence "Config hardcoded to 2024-2025 but API shows 2025-2026 is current." \
--flags "missing_coverage,config_stale" \
--mark-landmark "SkillImproved"
Rule of thumb: If you had to pause, reconsider, or work around something — log it with full detail. If it just worked perfectly — log minimally. The goal is signal, not noise.
Layer 2: Weekly Batch Analysis
Run manually or wait for the Sunday cron trigger.
Manual Trigger
Say: "run skill analysis" or "grow all skills"
The analysis does the following in order:
- Scan all skills — read every
references/usage_log.md
- Evaluate each skill across 6 dimensions (see
references/evaluation_engine.md)
- Generate proposals — for each skill with score below threshold
- Apply high-confidence changes — auto-edit SKILL.md for confident improvements
- Update dashboard — rewrite
references/dashboard.md
- Notify user — send summary message
Running Scripts Directly
# Full batch analysis (evaluate all skills, generate proposals)
python3 ~/.openclaw/workspace/skills/skill-garden/scripts/batch_analyze.py
# Analyze one skill only
python3 ~/.openclaw/workspace/skills/skill-garden/scripts/batch_analyze.py --skill github-trending-summary
# Dry run (proposals only, don't apply)
python3 ~/.openclaw/workspace/skills/skill-garden/scripts/batch_analyze.py --dry-run --min-confidence 70
# Generate/refresh dashboard
python3 ~/.openclaw/workspace/skills/skill-garden/scripts/generate_report.py
# Output as JSON (for integrations)
python3 ~/.openclaw/workspace/skills/skill-garden/scripts/generate_report.py --output json
Layer 3: Applying Improvements
The Six Evaluation Dimensions
| Dimension | Weight | What It Measures |
|---|
| Coverage | 30% | Does the skill's description match how it's actually used? |
| Completeness | 25% | Are all necessary steps present? Do FAIL events reveal missing coverage? |
| Clarity | 20% | Are steps unambiguous? Are there [confusing_step] or [user_workaround_used] flags? |
| Currency | 15% | Is the information still accurate? Are there [outdated_info] or [config_stale] flags? |
| Efficiency | 10% | Is it unnecessarily verbose or token-heavy? |
See references/evaluation_engine.md for the full evaluation algorithm, scoring thresholds, and confidence calibration guide.
Applying an Edit to SKILL.md
When a proposal meets the confidence threshold:
- Read the current SKILL.md
- Identify the exact text to replace using
edit tool
- Write the improved version
- Add a brief changelog note at the top of the edit:
<!-- Auto-improved by Skill Garden: YYYY-MM-DD
Reason: [confidence]% confidence — [evidence summary] -->
- Update
references/improvement_proposals.md to mark as applied
- Notify the user with a summary of what changed
Editing Checklist
Before applying any edit:
Dashboard
The dashboard (references/dashboard.md) shows:
- Overall skill ecosystem health
- Per-skill scores across all 6 dimensions
- Recent signals and flags
- Pending proposals
- Weekly outcome distribution
- Recent landmark events
Regenerate with:
python3 ~/.openclaw/workspace/skills/skill-garden/scripts/generate_report.py
Cron Setup
Set up a weekly Sunday 20:00 analysis run:
openclaw cron add \
--name "Skill Garden Weekly Analysis" \
--schedule '{"kind":"cron","expr":"0 20 * * 0","tz":"Asia/Shanghai"}' \
--sessionTarget isolated \
--payload '{"kind":"agentTurn","message":"Run full skill analysis: execute batch_analyze.py then generate_report.py and notify me of any improvements applied.","timeoutSeconds":900}'
Or manually trigger with:
grow all skills
Reference Files
| File | Purpose |
|---|
references/usage_tracker.md | Full logging schema, log rotation, landmark events |
references/evaluation_engine.md | Scoring algorithm, thresholds, confidence calibration |
references/improvement_examples.md | Real improvement examples with full context |
references/dashboard.md | Auto-generated growth dashboard (don't edit manually) |
references/master_log.md | Grower's own lightweight master log |
scripts/log_insight.py | CLI for writing log entries |
scripts/batch_analyze.py | Core analysis engine |
scripts/generate_report.py | Dashboard generator |
Examples in Context
Example 1: Passive Log — Skill Works Perfectly
User: "给我推荐一首今天听的歌"
Skill: daily-song-recommender — succeeds, email sent
Your action: Write one-line abbreviated log:
## 2026-04-22 09:15
Trigger: daily song recommendation
Outcome: OK
Signal: Covered: standard use case, no issues
Example 2: Passive Log — Skill Fails Due to Missing Coverage
User: "查一下这周的数学作业"
Skill: banxuebang-helper — API error: semester field stale
Your action: Write full structured log:
## 2026-04-22 21:30
### Trigger
banxuebang: check math homework
### Outcome
FAIL
### Signal
Missing: semester config hardcoded, not dynamic
### Evidence
API returned {"code": 500, "msg": "semester not found"}.
Config in skill has "2024-2025下学期" but actual is "2025-2026下学期".
### Flags
[missing_coverage] [config_stale]
Example 3: Passive Log — New Trigger Not in Description
User: "爬一下GitHub今日热榜发到我邮箱"
Skill: github-trending-summary — works, but user's phrasing not in description
Your action: Write log with [new_trigger] flag:
## 2026-04-22 10:00
### Trigger
github-trending-summary: crawl and email
### Outcome
OK
### Signal
Covered: but user said "爬" not in description triggers
### Evidence
Skill handled it fine, but description doesn't mention "爬" as a trigger phrase.
### Flags
[new_trigger]
Example 4: User Requests Analysis
User: "run skill analysis"
Your action:
- Run
batch_analyze.py --dry-run
- Read the proposals from output
- Apply high-confidence changes (≥90%) via
edit tool
- Run
generate_report.py to refresh dashboard
- Message user: "Found N improvement(s) — applied X automatically, Y need your review"
Example 5: Weekly Cron Fires (Sunday 20:00)
Isolated agent runs full cycle:
batch_analyze.py scans all 20 installed skills
- Finds
github-trending-summary: 1 [new_trigger] flag, coverage 66%
- Generates proposal with 65% confidence → written to proposals
- Updates dashboard
- You receive: "🌿 Weekly analysis done. github-trending-summary needs description update (65% confidence — needs more data to auto-apply). Review?"