autoresearch
Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology.
Triggers
Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.
Description
Autonomous prompt/strategy optimization using Karpathy's autoresearch pattern. Mutate → evaluate → keep improvements. Works on anything with a measurable score: trading strategies, content scripts, thumbnails, ad copy, email subjects.
How It Works
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 1. BASELINE │────▶│ 2. MUTATE │────▶│ 3. EVALUATE │────▶│ 4. DECIDE │
│ Score the │ │ Change one │ │ Run scoring │ │ Better? │
│ current │ │ thing │ │ function │ │ Keep : Revert│
│ version │ │ │ │ │ │ │
└─────────────┘ └─────────────┘ └─────────────┘ └──────┬───────┘
│
Loop back to 2
Instructions
Step 1: Identify the Mutable File
The mutable file is the thing you're optimizing. It can be:
- A SKILL.md prompt/instructions
- A trading strategy config (thresholds, parameters)
- A content template (YouTube script format, ad copy structure)
- Any text file where changes produce measurable differences
Create or identify this file. Example:
my-skill/
├── SKILL.md ← this is your mutable file
├── eval/
│ ├── test_cases.json
│ └── score.py
Step 2: Create an Evaluation Function
Your eval function must:
- Take the current mutable file as input
- Run it against test cases
- Return a numeric score (higher = better)
The eval can be anything:
- LLM-as-judge: Send output to an LLM, ask it to score 1-100
- Backtest: Run a strategy against historical data, measure Sharpe/returns
- A/B metrics: CTR, engagement, conversion rate
- Binary pass/fail: Count how many test cases pass out of N
Template eval function (customize for your domain):
# eval/score.py
import json
import sys
def evaluate(mutable_file_path: str, test_cases_path: str) -> float:
"""
Score the current version of the mutable file.
Returns a float — higher is better.
"""
with open(mutable_file_path) as f:
current_version = f.read()
with open(test_cases_path) as f:
test_cases = json.load(f)
scores = []
for case in test_cases:
# YOUR SCORING LOGIC HERE
# Example: run the prompt, compare output to expected
score = run_and_score(current_version, case)
scores.append(score)
return sum(scores) / len(scores)
if __name__ == "__main__":
score = evaluate(sys.argv[1], sys.argv[2])
print(f"SCORE: {score}")
Step 3: Run the Autoresearch Loop
The loop follows this exact pattern:
1. Git init (if not already) — every experiment is a commit
2. Run eval on current version → get BASELINE score
3. For each experiment (1..N):
a. Read the current mutable file
b. Generate a MUTATION (change one thing — a threshold, a phrase, a rule)
c. Write the mutated version
d. Run eval → get NEW score
e. If NEW > BASELINE:
- Git commit with message: "exp-{N}: {description} | score: {baseline} → {new}"
- Update BASELINE = NEW
- Log: "✅ KEPT — improvement"
f. If NEW <= BASELINE:
- Git checkout the mutable file (revert)
- Log: "❌ REVERTED — no improvement"
4. Print final summary: experiments run, improvements found, final score
Agent Instructions for Running the Loop
When the user says "run autoresearch on X", follow this procedure:
- Locate the mutable file — ask the user or infer from context
- Locate or create the eval function — the user must have a way to score
- Initialize git tracking in the project directory
- Run baseline eval — record the starting score
- Begin experiment loop:
- Read the mutable file
- Think about what single change might improve the score
- Make the change (be specific — change ONE thing per experiment)
- Run eval
- Keep or revert based on score
- Log the result
- Continue for N experiments (default: 20, or until user stops)
- Report results:
- Starting score → Final score
- Number of experiments run
- Number of improvements kept
- Summary of what changes worked
Mutation Strategy
Good mutations change ONE thing at a time:
- Numeric parameters: Adjust thresholds, weights, window sizes
- Prompt wording: Rephrase instructions, add/remove constraints
- Structure: Reorder sections, add examples, remove redundancy
- Rules: Add a new rule, tighten an existing one, relax a constraint
Bad mutations change everything at once — you can't learn what worked.
Step 4: Git Tracking
Every experiment MUST be tracked in git:
# Before starting
git init
git add -A
git commit -m "baseline: score {X}"
# After each successful mutation
git add -A
git commit -m "exp-{N}: {what changed} | {old_score} → {new_score}"
# After each failed mutation
git checkout -- {mutable_file}
This gives you:
- Full history of every experiment
- Ability to diff any two versions
- Easy rollback if something breaks
- A log of what mutations worked vs didn't
Proven Results
Case Study 1: Gold Trading Strategy
- Task: Optimize XAUUSD trading parameters
- Mutable file: Strategy config (EMA periods, momentum threshold, position sizing)
- Eval function: Backtest on historical data → Sharpe ratio
- Baseline: Sharpe 5.80
- Experiments: 86 in 25 minutes
- Final: Sharpe 12.23 (+111%)
- Key discoveries: Momentum threshold 0.003→0, EMA 8/24→5/11, position sizing optimization
- See:
references/gold-results.md
Case Study 2: YouTube Shorts Scripts
- Task: Optimize script-writing prompt for higher quality scores
- Mutable file: SKILL.md prompt instructions
- Eval function: LLM judge scoring 1-100
- Baseline: 94.3/100
- Experiments: 11
- Final: 96.7/100 (+2.5%)
- Key discoveries: Atomic sentences, strict 40-50 word range, stronger negative examples
- See:
references/youtube-results.md
Example Usage
User: "Run autoresearch on my email subject line skill"
Agent workflow:
- Read the skill's SKILL.md (mutable file)
- Create eval: generate 20 test emails → score subject lines with LLM judge (1-100 on open-rate prediction)
- Baseline: 72.4/100
- Experiment 1: Add "use numbers in subject lines" → 74.1 ✅ KEPT
- Experiment 2: Add "max 6 words" → 71.8 ❌ REVERTED
- Experiment 3: Add "start with a verb" → 75.3 ✅ KEPT
- ... continue for 20 experiments
- Final: 79.2/100 (+9.4%)
User: "Optimize my trading strategy config"
Agent workflow:
- Read strategy.json (mutable file)
- Eval: run backtest script → Sharpe ratio
- Baseline: Sharpe 2.1
- Experiment 1: Lower stop-loss from 2% to 1.5% → Sharpe 2.3 ✅
- Experiment 2: Increase EMA fast period 12→15 → Sharpe 1.9 ❌
- ... continue
- Final: Sharpe 3.8 (+81%)