Install
openclaw skills install autoresearch-skill-optimizerAuto-improve any OpenClaw skill using Karpathy's autoresearch loop. Runs skill repeatedly against test inputs, scores against a yes/no checklist, makes one targeted change, keeps if better, reverts if worse. Also audits skill structure against Anthropic's best practices (progressive disclosure, gotchas section, trigger-phrase description). Use when asked to "optimize this skill", "improve my skill", "run autoresearch on", "audit this skill", or before running any skill at scale (e.g., cold outreach). Based on Ole Lehmann's autoresearch method + Anthropic internal skill patterns (@trq212).
openclaw skills install autoresearch-skill-optimizerTwo-phase improvement system: (1) structural audit against Anthropic best practices, (2) iterative output quality loop.
Before optimizing output quality, audit the skill's architecture. Score against these 5 structural checks:
Structural Checklist:
## Gotchas section with at least one real failure case? (Highest-signal content per Anthropic)description field say when to use the skill, not just what it does? Must include "Use when..." or equivalent trigger condition.Score each: ✅ pass | ❌ fail | ⚠️ partial
For each failure: propose a concrete fix and apply if approved.
Quick wins to apply immediately:
## Gotchas\n- [Placeholder: add real failures here as they're discovered]references/ folder structureAfter structure audit, run the iterative improvement loop on the skill's actual outputs.
See references/checklist-examples.md for starter checklists by skill type (cold outreach, content, research, extraction, process/meta-skills).
Binary mode (default for simple skills): Yes/no per checklist item. Pass rate = total yes / (items × runs).
Dimensional mode (use for complex skills or when binary plateaus): Score each dimension 0-10. Identify the weakest dimension (lowest average across runs). Target that dimension for revision — do NOT rewrite everything.
Use dimensional mode when:
Round N:
1. Run skill against each test input
2. Score each output (binary: 1 per yes | dimensional: 0-10 per dimension)
3. Calculate score:
- Binary: pass rate = (total yes) / (items × runs)
- Dimensional: avg score per dimension across runs
4. Identify the weakest item/dimension (most failures or lowest avg score)
5. Make ONE targeted change to SKILL.md addressing ONLY that weakness
6. Re-run and re-score
7. If new score > old score: KEEP. Else: REVERT.
8. Log: score before/after, change made, dimension targeted, kept/reverted
Stop when: binary ≥ 95% (3 consecutive rounds) OR dimensional weakest ≥ 8/10 (3 consecutive) OR 20 rounds reached.
skills/{skill-name}/SKILL-optimized.md — improved version (original untouched)skills/{skill-name}/optimization-changelog.md — full round log## Structural Audit
- Gotchas section: ❌ → Added placeholder
- Description: ❌ → Rewritten as trigger condition
- Progressive disclosure: ⚠️ → Noted, deferred
## Round 1 (binary mode)
- Score: 4/10 (40%)
- Weakest item: "Does it mention business name?"
- Change: Added rule "Always open with [Business Name],"
- New score: 7/10 (70%)
- Decision: KEPT
## Round 2 (dimensional mode)
- Scores: Accuracy 8/10 | Tone 5/10 | Brevity 9/10 | Relevance 7/10
- Weakest dimension: Tone (5/10)
- Change: Added "Match prospect's industry language, not generic sales speak"
- New scores: Accuracy 8/10 | Tone 7/10 | Brevity 9/10 | Relevance 7/10
- Decision: KEPT (Tone +2)
Some skills don't produce text — they drive a process (e.g., this skill itself, planning workflows, research pipelines). For these:
What to score: Score the experience of following the process, not a text artifact.
How to test: Run the skill on 2-3 real tasks (not hypothetical). Score after each real use. The test inputs ARE the tasks you're applying the skill to.
Dimensional scoring for process skills: