Karpathy Autoresearch

Automation

Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology. Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.

Install

openclaw skills install karpathy-autoresearch

autoresearch

Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology.

Triggers

Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.

Description

Autonomous prompt/strategy optimization using Karpathy's autoresearch pattern. Mutate → evaluate → keep improvements. Works on anything with a measurable score: trading strategies, content scripts, thumbnails, ad copy, email subjects.

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  1. BASELINE │────▶│  2. MUTATE   │────▶│  3. EVALUATE │────▶│  4. DECIDE   │
│  Score the   │     │  Change one  │     │  Run scoring │     │  Better?     │
│  current     │     │  thing       │     │  function    │     │  Keep : Revert│
│  version     │     │              │     │              │     │              │
└─────────────┘     └─────────────┘     └─────────────┘     └──────┬───────┘
                                                                    │
                                                              Loop back to 2

Instructions

Step 1: Identify the Mutable File

The mutable file is the thing you're optimizing. It can be:

  • A SKILL.md prompt/instructions
  • A trading strategy config (thresholds, parameters)
  • A content template (YouTube script format, ad copy structure)
  • Any text file where changes produce measurable differences

Create or identify this file. Example:

my-skill/
├── SKILL.md          ← this is your mutable file
├── eval/
│   ├── test_cases.json
│   └── score.py

Step 2: Create an Evaluation Function

Your eval function must:

  1. Take the current mutable file as input
  2. Run it against test cases
  3. Return a numeric score (higher = better)

The eval can be anything:

  • LLM-as-judge: Send output to an LLM, ask it to score 1-100
  • Backtest: Run a strategy against historical data, measure Sharpe/returns
  • A/B metrics: CTR, engagement, conversion rate
  • Binary pass/fail: Count how many test cases pass out of N

Template eval function (customize for your domain):

# eval/score.py
import json
import sys

def evaluate(mutable_file_path: str, test_cases_path: str) -> float:
    """
    Score the current version of the mutable file.
    Returns a float — higher is better.
    """
    with open(mutable_file_path) as f:
        current_version = f.read()
    
    with open(test_cases_path) as f:
        test_cases = json.load(f)
    
    scores = []
    for case in test_cases:
        # YOUR SCORING LOGIC HERE
        # Example: run the prompt, compare output to expected
        score = run_and_score(current_version, case)
        scores.append(score)
    
    return sum(scores) / len(scores)

if __name__ == "__main__":
    score = evaluate(sys.argv[1], sys.argv[2])
    print(f"SCORE: {score}")

Step 3: Run the Autoresearch Loop

The loop follows this exact pattern:

1. Git init (if not already) — every experiment is a commit
2. Run eval on current version → get BASELINE score
3. For each experiment (1..N):
   a. Read the current mutable file
   b. Generate a MUTATION (change one thing — a threshold, a phrase, a rule)
   c. Write the mutated version
   d. Run eval → get NEW score
   e. If NEW > BASELINE:
      - Git commit with message: "exp-{N}: {description} | score: {baseline} → {new}"
      - Update BASELINE = NEW
      - Log: "✅ KEPT — improvement"
   f. If NEW <= BASELINE:
      - Git checkout the mutable file (revert)
      - Log: "❌ REVERTED — no improvement"
4. Print final summary: experiments run, improvements found, final score

Agent Instructions for Running the Loop

When the user says "run autoresearch on X", follow this procedure:

  1. Locate the mutable file — ask the user or infer from context
  2. Locate or create the eval function — the user must have a way to score
  3. Initialize git tracking in the project directory
  4. Run baseline eval — record the starting score
  5. Begin experiment loop:
    • Read the mutable file
    • Think about what single change might improve the score
    • Make the change (be specific — change ONE thing per experiment)
    • Run eval
    • Keep or revert based on score
    • Log the result
  6. Continue for N experiments (default: 20, or until user stops)
  7. Report results:
    • Starting score → Final score
    • Number of experiments run
    • Number of improvements kept
    • Summary of what changes worked

Mutation Strategy

Good mutations change ONE thing at a time:

  • Numeric parameters: Adjust thresholds, weights, window sizes
  • Prompt wording: Rephrase instructions, add/remove constraints
  • Structure: Reorder sections, add examples, remove redundancy
  • Rules: Add a new rule, tighten an existing one, relax a constraint

Bad mutations change everything at once — you can't learn what worked.

Step 4: Git Tracking

Every experiment MUST be tracked in git:

# Before starting
git init
git add -A
git commit -m "baseline: score {X}"

# After each successful mutation
git add -A
git commit -m "exp-{N}: {what changed} | {old_score} → {new_score}"

# After each failed mutation
git checkout -- {mutable_file}

This gives you:

  • Full history of every experiment
  • Ability to diff any two versions
  • Easy rollback if something breaks
  • A log of what mutations worked vs didn't

Proven Results

Case Study 1: Gold Trading Strategy

  • Task: Optimize XAUUSD trading parameters
  • Mutable file: Strategy config (EMA periods, momentum threshold, position sizing)
  • Eval function: Backtest on historical data → Sharpe ratio
  • Baseline: Sharpe 5.80
  • Experiments: 86 in 25 minutes
  • Final: Sharpe 12.23 (+111%)
  • Key discoveries: Momentum threshold 0.003→0, EMA 8/24→5/11, position sizing optimization
  • See: references/gold-results.md

Case Study 2: YouTube Shorts Scripts

  • Task: Optimize script-writing prompt for higher quality scores
  • Mutable file: SKILL.md prompt instructions
  • Eval function: LLM judge scoring 1-100
  • Baseline: 94.3/100
  • Experiments: 11
  • Final: 96.7/100 (+2.5%)
  • Key discoveries: Atomic sentences, strict 40-50 word range, stronger negative examples
  • See: references/youtube-results.md

Example Usage

User: "Run autoresearch on my email subject line skill"

Agent workflow:

  1. Read the skill's SKILL.md (mutable file)
  2. Create eval: generate 20 test emails → score subject lines with LLM judge (1-100 on open-rate prediction)
  3. Baseline: 72.4/100
  4. Experiment 1: Add "use numbers in subject lines" → 74.1 ✅ KEPT
  5. Experiment 2: Add "max 6 words" → 71.8 ❌ REVERTED
  6. Experiment 3: Add "start with a verb" → 75.3 ✅ KEPT
  7. ... continue for 20 experiments
  8. Final: 79.2/100 (+9.4%)

User: "Optimize my trading strategy config"

Agent workflow:

  1. Read strategy.json (mutable file)
  2. Eval: run backtest script → Sharpe ratio
  3. Baseline: Sharpe 2.1
  4. Experiment 1: Lower stop-loss from 2% to 1.5% → Sharpe 2.3 ✅
  5. Experiment 2: Increase EMA fast period 12→15 → Sharpe 1.9 ❌
  6. ... continue
  7. Final: Sharpe 3.8 (+81%)