Autoresearch

v1.0.0

Autonomous AI research skill for running automated neural network experiments. This skill should be used when the user wants to set up autonomous AI research...

⭐ 0· 233·0 current·0 all-time

by@baiyunrei2025

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for baiyunrei2025/autoresearch-karpathy.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Autoresearch" (baiyunrei2025/autoresearch-karpathy) from ClawHub.
Skill page: https://clawhub.ai/baiyunrei2025/autoresearch-karpathy
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install autoresearch-karpathy

ClawHub CLI

Package manager switcher

npx clawhub@latest install autoresearch-karpathy

Security Scan

VirusTotal

Suspicious

View report →

OpenClaw

Suspicious

medium confidence

ℹ

Purpose & Capability

The files (prepare.py, train.py, program.md, SKILL.md) implement exactly the claimed functionality: an agent-modifiable training script with a 5-minute experiment loop. However the package metadata provides no homepage and 'Source: unknown' while SKILL.md and README claim this is 'based on Andrej Karpathy's autoresearch' and instruct cloning https://github.com/karpathy/autoresearch; the provenance is unclear (possible impersonation or a fork). Aside from provenance, the requested files and operations are consistent with an autonomous research skill.

Instruction Scope

Runtime instructions explicitly direct autonomous modification of train.py, committing changes, and running experiments in an infinite loop with the mandate 'DO NOT pause… Continue working indefinitely until manually stopped.' The skill instructs network activity (git clone, downloading data shards from Hugging Face) and running arbitrary Python code that the agent edits. While these are coherent with the purpose, the 'never ask the human'/autonomous forever behavior and the ability to make and execute arbitrary code changes are high-risk operationally and scope-expanding.

ℹ

Install Mechanism

There is no install spec embedded in the skill bundle (instruction-only), but the docs direct use of external installers and package managers: running 'curl ... | sh' to install 'uv', then 'uv sync' to pull dependencies including torch from a custom PyTorch index. These are standard for ML projects but will fetch and install large packages and potentially repo-sourced kernels (the 'kernels' dependency can select flash-attn backends). The external install scripts and heavy dependencies increase operational risk and should be reviewed before execution.

✓

Credentials

The skill declares no required environment variables or credentials. It does read/write to user cache paths (~/.cache/autoresearch) which is expected for dataset/tokenizer storage. Network access is required to download datasets and potentially kernel code, but requested environment/credential access is proportionate to the stated training task.

Persistence & Privilege

The skill does not set 'always: true', but SKILL.md explicitly instructs the agent to run autonomously and 'NEVER STOP' once the loop begins. Autonomous invocation combined with an instruction to loop forever and to modify-and-execute code increases blast radius: the agent could make many successive code changes and run them repeatedly. This is permitted by platform defaults but is a significant operational risk and should be limited by human controls (timeouts, resource quotas, manual approval).

What to consider before installing

This skill appears to implement an autonomous experiment loop as claimed, but take these precautions before installing or running it: 1. Verify provenance: the registry metadata lacks a homepage and lists source 'unknown' while the README/README.md references Karpathy's repo. Confirm you have the authentic upstream repository (check commit history on GitHub) or treat this as an untrusted fork. 2. Review code locally first: read prepare.py and train.py in full and search for any network endpoints, telemetry, or unexpected system calls. Pay attention to any code that would send data off-host or read unrelated files. 3. Run in isolation: execute inside a disposable VM, container, or dedicated machine (not your laptop or production host). Limit GPU access and disk usage. Create a small test run (use limited num-shards) to avoid massive downloads and long/expensive runs. 4. Restrict autonomy: do not allow the agent to run 'forever'. Add explicit limits (max iterations, wall-clock timeout, human-in-the-loop approval steps) before letting it auto-commit and re-run code. 5. Monitor resource usage: the skill downloads datasets, installs heavy packages (PyTorch), and runs GPU jobs. Set caps on bandwidth, disk, and GPU time to prevent abuse or accidental large bills. 6. Validate external installers: the README suggests running a curl | sh to install 'uv'—avoid blind execution of arbitrary install scripts; inspect them first or install via your platform's package manager. 7. If you intend to proceed, prefer running a baseline single manual run first (not autonomous) to confirm behavior, then enable limited automated experimentation with oversight. If you want, I can: (a) summarize all external network endpoints the code touches, (b) point out exact locations in the code where the agent can inject changes, or (c) suggest safe modifications to program.md/prepare.py to add human approval checkpoints and resource limits.

✗

train.py:611

Dynamic code execution detected.

Patterns worth reviewing

These patterns may indicate risky behavior. Check the VirusTotal and OpenClaw results above for context-aware analysis before installing.

Like a lobster shell, security has layers — review code before you run it.

latestvk972p8xfzjxj7pze94cxzt7ak1839dtb

233downloads

0stars

1versions

Updated 11h ago

v1.0.0

MIT-0

Autoresearch Skill

This skill enables autonomous AI research experiments based on Andrej Karpathy's autoresearch project. It allows AI agents to autonomously modify neural network training code, run experiments, evaluate results, and iteratively improve models.

Core Concept

The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously. The agent modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You can leave it running overnight and wake up to a log of experiments and (hopefully) a better model.

Key Files

The project has three core files:

prepare.py — Fixed constants, one-time data prep (downloads training data, trains a BPE tokenizer), and runtime utilities (dataloader, evaluation). Not modified.
train.py — The single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc. This file is edited and iterated on by the agent.
program.md — Baseline instructions for the agent. This file is edited and iterated on by the human.

Requirements

Single NVIDIA GPU (tested on H100)
Python 3.10+
uv package manager

Quick Start Workflow

Phase 1: Initial Setup

Clone the repository (if not already done):

git clone https://github.com/karpathy/autoresearch.git
cd autoresearch

Install dependencies:
```
uv sync
```
Prepare data (one-time setup):
```
uv run prepare.py
```

Phase 2: Experiment Setup

Agree on a run tag (e.g., based on date like mar20)
Create a new branch:
```
git checkout -b autoresearch/<tag>
```

Initialize results file:

echo -e "commit\tval_bpb\tmemory_gb\tstatus\tdescription" > results.tsv

Phase 3: Autonomous Experimentation Loop

The agent follows this loop indefinitely:

LOOP FOREVER:
  1. Look at current git state
  2. Modify train.py with experimental idea
  3. git commit
  4. Run experiment: uv run train.py > run.log 2>&1
  5. Extract results: grep "^val_bpb:\|^peak_vram_mb:" run.log
  6. If crash → analyze logs and fix or mark as crash
  7. Record results in results.tsv
  8. If improved → keep commit
  9. If not improved → git reset

Key Metrics

val_bpb (validation bits per byte) — Lower is better, vocab-size-independent
Training time — Fixed 5-minute budget per experiment
Peak VRAM — Memory usage in GB
Status — keep, discard, or crash

Constraints

What the agent CAN do:

Modify train.py (architecture, optimizer, hyperparameters, training loop, etc.)
Experiment with different model configurations
Run training experiments autonomously

What the agent CANNOT do:

Modify prepare.py (read-only)
Install new packages or add dependencies
Modify the evaluation harness

Quality Criteria

Simplicity: Simpler solutions are preferred over complex ones
Performance: Lower val_bpb is better
Memory: VRAM usage should be reasonable
Stability: Code must run without crashing

Output Format

Each experiment produces a summary:

---
val_bpb:          0.997900
training_seconds: 300.1
total_seconds:    325.9
peak_vram_mb:     45060.2
mfu_percent:      39.80
total_tokens_M:   499.6
num_steps:        953
num_params_M:     50.3
depth:            8

Results Logging

Results are logged to results.tsv (tab-separated):

commit	val_bpb	memory_gb	status	description
a1b2c3d	0.997900	44.0	keep	baseline
b2c3d4e	0.993200	44.2	keep	increase LR to 0.04
c3d4e5f	1.005000	44.0	discard	switch to GeLU activation
d4e5f6g	0.000000	0.0	crash	double model width (OOM)

Autonomous Operation

CRITICAL: Once the experiment loop begins, the agent operates autonomously:

Do NOT pause to ask the human if you should continue
Do NOT ask "should I keep going?" or "is this a good stopping point?"
Continue working indefinitely until manually stopped
If out of ideas, think harder: read papers, re-analyze code, try radical changes

Use Cases

Overnight experiments: Leave running while sleeping, wake up to results
Architecture search: Automatically explore model architectures
Hyperparameter optimization: Find optimal training parameters
Research automation: Reduce manual experimentation effort

Troubleshooting

Common Issues:

GPU not available: Check CUDA installation and GPU drivers
uv not installed: Install uv package manager
Data not prepared: Run uv run prepare.py
Out of memory: Reduce model size or batch size

Error Handling:

Crashes are logged as crash status
Analyze logs with tail -n 50 run.log
Fix simple issues and retry, skip fundamentally broken ideas

Best Practices

Start with baseline: Always run unmodified code first
Incremental changes: Make small, focused modifications
Document experiments: Clear descriptions in results.tsv
Monitor progress: Regularly check results and trends
Balance exploration/exploitation: Mix radical ideas with incremental improvements

Integration with Agent Teams

This skill can be combined with the agent-teams-playbook skill for:

Multi-agent research coordination
Parallel experimentation
Specialized roles (architect, optimizer, evaluator)
Distributed research workflows

References

Original repository: https://github.com/karpathy/autoresearch
Nanochat implementation: https://github.com/karpathy/nanochat
Project announcement: https://x.com/karpathy/status/2029701092347630069
"Dummy's Guide": https://x.com/hooeem/status/2030720614752039185

Comments

Loading comments...