Install
openclaw skills install karpathy-autoresearchAutonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology. Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.
openclaw skills install karpathy-autoresearchAutonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology.
Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.
Autonomous prompt/strategy optimization using Karpathy's autoresearch pattern. Mutate → evaluate → keep improvements. Works on anything with a measurable score: trading strategies, content scripts, thumbnails, ad copy, email subjects.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 1. BASELINE │────▶│ 2. MUTATE │────▶│ 3. EVALUATE │────▶│ 4. DECIDE │
│ Score the │ │ Change one │ │ Run scoring │ │ Better? │
│ current │ │ thing │ │ function │ │ Keep : Revert│
│ version │ │ │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘ └──────┬───────┘
│
Loop back to 2
The mutable file is the thing you're optimizing. It can be:
Create or identify this file. Example:
my-skill/
├── SKILL.md ← this is your mutable file
├── eval/
│ ├── test_cases.json
│ └── score.py
Your eval function must:
The eval can be anything:
Template eval function (customize for your domain):
# eval/score.py
import json
import sys
def evaluate(mutable_file_path: str, test_cases_path: str) -> float:
"""
Score the current version of the mutable file.
Returns a float — higher is better.
"""
with open(mutable_file_path) as f:
current_version = f.read()
with open(test_cases_path) as f:
test_cases = json.load(f)
scores = []
for case in test_cases:
# YOUR SCORING LOGIC HERE
# Example: run the prompt, compare output to expected
score = run_and_score(current_version, case)
scores.append(score)
return sum(scores) / len(scores)
if __name__ == "__main__":
score = evaluate(sys.argv[1], sys.argv[2])
print(f"SCORE: {score}")
The loop follows this exact pattern:
1. Git init (if not already) — every experiment is a commit
2. Run eval on current version → get BASELINE score
3. For each experiment (1..N):
a. Read the current mutable file
b. Generate a MUTATION (change one thing — a threshold, a phrase, a rule)
c. Write the mutated version
d. Run eval → get NEW score
e. If NEW > BASELINE:
- Git commit with message: "exp-{N}: {description} | score: {baseline} → {new}"
- Update BASELINE = NEW
- Log: "✅ KEPT — improvement"
f. If NEW <= BASELINE:
- Git checkout the mutable file (revert)
- Log: "❌ REVERTED — no improvement"
4. Print final summary: experiments run, improvements found, final score
When the user says "run autoresearch on X", follow this procedure:
Good mutations change ONE thing at a time:
Bad mutations change everything at once — you can't learn what worked.
Every experiment MUST be tracked in git:
# Before starting
git init
git add -A
git commit -m "baseline: score {X}"
# After each successful mutation
git add -A
git commit -m "exp-{N}: {what changed} | {old_score} → {new_score}"
# After each failed mutation
git checkout -- {mutable_file}
This gives you:
references/gold-results.mdreferences/youtube-results.mdUser: "Run autoresearch on my email subject line skill"
Agent workflow:
User: "Optimize my trading strategy config"
Agent workflow: