Karpathy Autoresearch

v1.0.0

Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based...

⭐ 0· 112·0 current·0 all-time

by@alannjaf

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for alannjaf/karpathy-autoresearch.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Karpathy Autoresearch" (alannjaf/karpathy-autoresearch) from ClawHub.
Skill page: https://clawhub.ai/alannjaf/karpathy-autoresearch
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install karpathy-autoresearch

ClawHub CLI

Package manager switcher

npx clawhub@latest install karpathy-autoresearch

Security Scan

VirusTotal

Suspicious

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

The name/description (autoresearch loop) aligns with included code (loop.py, evaluate.py) and README. However the skill metadata declares no required binaries or credentials while the scripts implicitly require git and a working shell environment and realistically will need LLM/backtest tooling and possibly API keys to implement evals — a mild incoherence between declared requirements and what is needed in practice.

ℹ

Instruction Scope

SKILL.md precisely instructs the agent to locate a mutable file, create or use an eval function, initialize git, mutate the file, run evals, and commit/revert. That scope is consistent with the purpose. Caution: these instructions give the agent permission to modify and commit arbitrary files in the target workdir — if the 'mutable file' or working directory is pointed at sensitive configs or system files the loop could change them. The reference loop expects interactive or agent-driven mutations and runs arbitrary eval commands provided by the user.

✓

Install Mechanism

No install spec (instruction-only) and included scripts are a reference implementation. This is low install risk — nothing is downloaded from arbitrary URLs. The files will exist on disk as part of the skill bundle, which is expected.

ℹ

Credentials

The skill declares no required env vars, but realistic use (LLM-judge, backtest data, external scoring harnesses) will likely require API keys, data access, or other credentials that are not declared. The README also asks users to pay $99 USDT to a provided crypto address and DM a Telegram handle to unlock a 'Pro' tier — this is external monetization/contact and not a credential leak, but it is unrelated to skill functionality and could be a red flag for some users.

ℹ

Persistence & Privilege

always:false (normal). The skill can be invoked autonomously (default). Combined with its ability to modify files and run arbitrary eval commands (shell subprocess with shell=True), autonomous runs could have a wide blast radius if the agent is allowed to operate on sensitive directories. The skill does not request persistent system-level privileges or modify other skills' configs.

What to consider before installing

High-level points to consider before installing or running this skill: - Functionality is coherent: it implements the mutate→evaluate→keep loop and includes reference scripts (loop.py, evaluate.py). That said, the package metadata omits some practical requirements — verify you have git and a safe working directory available. - Review and control the 'mutable file' and working directory: the agent will read, write, and git-commit whatever file you point it at. Do NOT point it at system configs, secrets, SSH keys, or any repository containing credentials. Run experiments in an isolated project or sandbox. - Eval commands run arbitrary subprocesses: loop.py runs whatever eval command you supply (via shell=True) and parses numeric output. Ensure your eval harness is trusted and does not perform unwanted network calls or exfiltration. Treat the eval command as code you must review. - evaluate.py requires you to implement score_one(); by default it raises NotImplementedError. If you implement LLM-based judging or backtests, you will likely need API keys and data access that the skill does not declare — keep credentials out of the mutable file and out of experiment commits. - The skill's README includes a crypto payment address and Telegram contact for a paid 'Pro' tier. This is external monetization and unrelated to the skill code — be cautious when sending funds or contacting external handles. - Best practices: run the skill in an isolated environment, inspect and possibly modify the provided scripts before running, keep a separate git repo or sandbox for experiments, and avoid letting the agent autonomously run experiments on repositories containing sensitive data. If you plan to use LLM judges or external services, create and scope API keys appropriately (principle of least privilege).

Like a lobster shell, security has layers — review code before you run it.

latestvk9746v9wb1zaee0rxfkt0cpa8h83dpcz

112downloads

0stars

1versions

Updated 1mo ago

v1.0.0

MIT-0

autoresearch

Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology.

Triggers

Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.

Description

Autonomous prompt/strategy optimization using Karpathy's autoresearch pattern. Mutate → evaluate → keep improvements. Works on anything with a measurable score: trading strategies, content scripts, thumbnails, ad copy, email subjects.

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  1. BASELINE │────▶│  2. MUTATE   │────▶│  3. EVALUATE │────▶│  4. DECIDE   │
│  Score the   │     │  Change one  │     │  Run scoring │     │  Better?     │
│  current     │     │  thing       │     │  function    │     │  Keep : Revert│
│  version     │     │              │     │              │     │              │
└─────────────┘     └─────────────┘     └─────────────┘     └──────┬───────┘
                                                                    │
                                                              Loop back to 2

Instructions

Step 1: Identify the Mutable File

The mutable file is the thing you're optimizing. It can be:

A SKILL.md prompt/instructions
A trading strategy config (thresholds, parameters)
A content template (YouTube script format, ad copy structure)
Any text file where changes produce measurable differences

Create or identify this file. Example:

my-skill/
├── SKILL.md          ← this is your mutable file
├── eval/
│   ├── test_cases.json
│   └── score.py

Step 2: Create an Evaluation Function

Your eval function must:

Take the current mutable file as input
Run it against test cases
Return a numeric score (higher = better)

The eval can be anything:

LLM-as-judge: Send output to an LLM, ask it to score 1-100
Backtest: Run a strategy against historical data, measure Sharpe/returns
A/B metrics: CTR, engagement, conversion rate
Binary pass/fail: Count how many test cases pass out of N

Template eval function (customize for your domain):

# eval/score.py
import json
import sys

def evaluate(mutable_file_path: str, test_cases_path: str) -> float:
    """
    Score the current version of the mutable file.
    Returns a float — higher is better.
    """
    with open(mutable_file_path) as f:
        current_version = f.read()
    
    with open(test_cases_path) as f:
        test_cases = json.load(f)
    
    scores = []
    for case in test_cases:
        # YOUR SCORING LOGIC HERE
        # Example: run the prompt, compare output to expected
        score = run_and_score(current_version, case)
        scores.append(score)
    
    return sum(scores) / len(scores)

if __name__ == "__main__":
    score = evaluate(sys.argv[1], sys.argv[2])
    print(f"SCORE: {score}")

Step 3: Run the Autoresearch Loop

The loop follows this exact pattern:

1. Git init (if not already) — every experiment is a commit
2. Run eval on current version → get BASELINE score
3. For each experiment (1..N):
   a. Read the current mutable file
   b. Generate a MUTATION (change one thing — a threshold, a phrase, a rule)
   c. Write the mutated version
   d. Run eval → get NEW score
   e. If NEW > BASELINE:
      - Git commit with message: "exp-{N}: {description} | score: {baseline} → {new}"
      - Update BASELINE = NEW
      - Log: "✅ KEPT — improvement"
   f. If NEW <= BASELINE:
      - Git checkout the mutable file (revert)
      - Log: "❌ REVERTED — no improvement"
4. Print final summary: experiments run, improvements found, final score

Agent Instructions for Running the Loop

When the user says "run autoresearch on X", follow this procedure:

Locate the mutable file — ask the user or infer from context
Locate or create the eval function — the user must have a way to score
Initialize git tracking in the project directory
Run baseline eval — record the starting score
Begin experiment loop:
- Read the mutable file
- Think about what single change might improve the score
- Make the change (be specific — change ONE thing per experiment)
- Run eval
- Keep or revert based on score
- Log the result
Continue for N experiments (default: 20, or until user stops)
Report results:
- Starting score → Final score
- Number of experiments run
- Number of improvements kept
- Summary of what changes worked

Mutation Strategy

Good mutations change ONE thing at a time:

Numeric parameters: Adjust thresholds, weights, window sizes
Prompt wording: Rephrase instructions, add/remove constraints
Structure: Reorder sections, add examples, remove redundancy
Rules: Add a new rule, tighten an existing one, relax a constraint

Bad mutations change everything at once — you can't learn what worked.

Step 4: Git Tracking

Every experiment MUST be tracked in git:

# Before starting
git init
git add -A
git commit -m "baseline: score {X}"

# After each successful mutation
git add -A
git commit -m "exp-{N}: {what changed} | {old_score} → {new_score}"

# After each failed mutation
git checkout -- {mutable_file}

This gives you:

Full history of every experiment
Ability to diff any two versions
Easy rollback if something breaks
A log of what mutations worked vs didn't

Proven Results

Case Study 1: Gold Trading Strategy

Task: Optimize XAUUSD trading parameters
Mutable file: Strategy config (EMA periods, momentum threshold, position sizing)
Eval function: Backtest on historical data → Sharpe ratio
Baseline: Sharpe 5.80
Experiments: 86 in 25 minutes
Final: Sharpe 12.23 (+111%)
Key discoveries: Momentum threshold 0.003→0, EMA 8/24→5/11, position sizing optimization
See: references/gold-results.md

Case Study 2: YouTube Shorts Scripts

Task: Optimize script-writing prompt for higher quality scores
Mutable file: SKILL.md prompt instructions
Eval function: LLM judge scoring 1-100
Baseline: 94.3/100
Experiments: 11
Final: 96.7/100 (+2.5%)
Key discoveries: Atomic sentences, strict 40-50 word range, stronger negative examples
See: references/youtube-results.md

Example Usage

User: "Run autoresearch on my email subject line skill"

Agent workflow:

Read the skill's SKILL.md (mutable file)
Create eval: generate 20 test emails → score subject lines with LLM judge (1-100 on open-rate prediction)
Baseline: 72.4/100
Experiment 1: Add "use numbers in subject lines" → 74.1 ✅ KEPT
Experiment 2: Add "max 6 words" → 71.8 ❌ REVERTED
Experiment 3: Add "start with a verb" → 75.3 ✅ KEPT
... continue for 20 experiments
Final: 79.2/100 (+9.4%)

User: "Optimize my trading strategy config"

Agent workflow:

Read strategy.json (mutable file)
Eval: run backtest script → Sharpe ratio
Baseline: Sharpe 2.1
Experiment 1: Lower stop-loss from 2% to 1.5% → Sharpe 2.3 ✅
Experiment 2: Increase EMA fast period 12→15 → Sharpe 1.9 ❌
... continue
Final: Sharpe 3.8 (+81%)

Comments

Loading comments...