Karpathy Autoresearch

Automation

Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology. Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.

Install

openclaw skills install karpathy-autoresearch

autoresearch

Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology.

Triggers

Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.

Description

Autonomous prompt/strategy optimization using Karpathy's autoresearch pattern. Mutate → evaluate → keep improvements. Works on anything with a measurable score: trading strategies, content scripts, thumbnails, ad copy, email subjects.

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  1. BASELINE │────▶│  2. MUTATE   │────▶│  3. EVALUATE │────▶│  4. DECIDE   │
│  Score the   │     │  Change one  │     │  Run scoring │     │  Better?     │
│  current     │     │  thing       │     │  function    │     │  Keep : Revert│
│  version     │     │              │     │              │     │              │
└─────────────┘     └─────────────┘     └─────────────┘     └──────┬───────┘
                                                                    │
                                                              Loop back to 2

Instructions

Step 1: Identify the Mutable File

The mutable file is the thing you're optimizing. It can be:

A SKILL.md prompt/instructions
A trading strategy config (thresholds, parameters)
A content template (YouTube script format, ad copy structure)
Any text file where changes produce measurable differences

Create or identify this file. Example:

my-skill/
├── SKILL.md          ← this is your mutable file
├── eval/
│   ├── test_cases.json
│   └── score.py

Step 2: Create an Evaluation Function

Your eval function must:

Take the current mutable file as input
Run it against test cases
Return a numeric score (higher = better)

The eval can be anything:

LLM-as-judge: Send output to an LLM, ask it to score 1-100
Backtest: Run a strategy against historical data, measure Sharpe/returns
A/B metrics: CTR, engagement, conversion rate
Binary pass/fail: Count how many test cases pass out of N

Template eval function (customize for your domain):

# eval/score.py
import json
import sys

def evaluate(mutable_file_path: str, test_cases_path: str) -> float:
    """
    Score the current version of the mutable file.
    Returns a float — higher is better.
    """
    with open(mutable_file_path) as f:
        current_version = f.read()
    
    with open(test_cases_path) as f:
        test_cases = json.load(f)
    
    scores = []
    for case in test_cases:
        # YOUR SCORING LOGIC HERE
        # Example: run the prompt, compare output to expected
        score = run_and_score(current_version, case)
        scores.append(score)
    
    return sum(scores) / len(scores)

if __name__ == "__main__":
    score = evaluate(sys.argv[1], sys.argv[2])
    print(f"SCORE: {score}")

Step 3: Run the Autoresearch Loop

The loop follows this exact pattern:

1. Git init (if not already) — every experiment is a commit
2. Run eval on current version → get BASELINE score
3. For each experiment (1..N):
   a. Read the current mutable file
   b. Generate a MUTATION (change one thing — a threshold, a phrase, a rule)
   c. Write the mutated version
   d. Run eval → get NEW score
   e. If NEW > BASELINE:
      - Git commit with message: "exp-{N}: {description} | score: {baseline} → {new}"
      - Update BASELINE = NEW
      - Log: "✅ KEPT — improvement"
   f. If NEW <= BASELINE:
      - Git checkout the mutable file (revert)
      - Log: "❌ REVERTED — no improvement"
4. Print final summary: experiments run, improvements found, final score

Agent Instructions for Running the Loop

When the user says "run autoresearch on X", follow this procedure:

Locate the mutable file — ask the user or infer from context
Locate or create the eval function — the user must have a way to score
Initialize git tracking in the project directory
Run baseline eval — record the starting score
Begin experiment loop:
- Read the mutable file
- Think about what single change might improve the score
- Make the change (be specific — change ONE thing per experiment)
- Run eval
- Keep or revert based on score
- Log the result
Continue for N experiments (default: 20, or until user stops)
Report results:
- Starting score → Final score
- Number of experiments run
- Number of improvements kept
- Summary of what changes worked

Mutation Strategy

Good mutations change ONE thing at a time:

Numeric parameters: Adjust thresholds, weights, window sizes
Prompt wording: Rephrase instructions, add/remove constraints
Structure: Reorder sections, add examples, remove redundancy
Rules: Add a new rule, tighten an existing one, relax a constraint

Bad mutations change everything at once — you can't learn what worked.

Step 4: Git Tracking

Every experiment MUST be tracked in git:

# Before starting
git init
git add -A
git commit -m "baseline: score {X}"

# After each successful mutation
git add -A
git commit -m "exp-{N}: {what changed} | {old_score} → {new_score}"

# After each failed mutation
git checkout -- {mutable_file}

This gives you:

Full history of every experiment
Ability to diff any two versions
Easy rollback if something breaks
A log of what mutations worked vs didn't

Proven Results

Case Study 1: Gold Trading Strategy

Task: Optimize XAUUSD trading parameters
Mutable file: Strategy config (EMA periods, momentum threshold, position sizing)
Eval function: Backtest on historical data → Sharpe ratio
Baseline: Sharpe 5.80
Experiments: 86 in 25 minutes
Final: Sharpe 12.23 (+111%)
Key discoveries: Momentum threshold 0.003→0, EMA 8/24→5/11, position sizing optimization
See: references/gold-results.md

Case Study 2: YouTube Shorts Scripts

Task: Optimize script-writing prompt for higher quality scores
Mutable file: SKILL.md prompt instructions
Eval function: LLM judge scoring 1-100
Baseline: 94.3/100
Experiments: 11
Final: 96.7/100 (+2.5%)
Key discoveries: Atomic sentences, strict 40-50 word range, stronger negative examples
See: references/youtube-results.md

Example Usage

User: "Run autoresearch on my email subject line skill"

Agent workflow:

Read the skill's SKILL.md (mutable file)
Create eval: generate 20 test emails → score subject lines with LLM judge (1-100 on open-rate prediction)
Baseline: 72.4/100
Experiment 1: Add "use numbers in subject lines" → 74.1 ✅ KEPT
Experiment 2: Add "max 6 words" → 71.8 ❌ REVERTED
Experiment 3: Add "start with a verb" → 75.3 ✅ KEPT
... continue for 20 experiments
Final: 79.2/100 (+9.4%)

User: "Optimize my trading strategy config"