Brainforge Autoresearch

Prompt

Use when user wants to optimize, improve, benchmark, or evaluate a skill's prompt. Triggers on "optimize skill", "improve skill prompt", "benchmark skill", "eval skill", "run autoresearch", "tune prompt", "prompt optimization", "skill evaluation", "A/B test prompt", "find best prompt", "auto-improve skill". Runs automated prompt experiments using the Karpathy autoresearch pattern.

Install

openclaw skills install @zning1994/brainforge-autoresearch

brainforge-autoresearch

Previously published as autoresearch / openclaw-autoresearch. Renamed for the brainforge marketplace rollout — functionality unchanged.

Autonomous prompt optimization for AI agent skills. Runs controlled experiments to find better prompt variants using the Karpathy autoresearch pattern: generate hypothesis, mutate prompt, evaluate, repeat.

When to use

用户说"优化一下这个 skill" / User says "optimize this skill's prompt"
用户要对比不同 prompt 版本的效果 / User wants to benchmark prompt variants
用户说"run autoresearch on X" / "eval skill X" / "improve skill X"
用户对 skill 输出质量不满，想系统性改进 / User is unhappy with skill output quality and wants systematic improvement

Do not use:

一次性的小改动（直接改 prompt 即可） / One-off prompt tweaks — just edit the prompt directly
调试某个特定失败 case / Debugging a specific failure — investigate the root cause instead
Skill 脚本本身有 bug（代码逻辑问题不是 prompt 问题） / Skill script has a bug — fix the code, not the prompt

Requirements

Python 3.10+
autoresearch.py script in the skill directory
LLM API access (MiniMax, OpenAI, or Anthropic)
Target skill must have a prompt file (SKILL.md, SYSTEM.md, or similar)

Procedure

Always follow these steps in order: (1) Create eval.json, (2) Run autoresearch command, (3) Review results and apply best prompt.

Step 1: Gather context

Before running, you need:

Parameter	Description	Example
`--target`	Path to the skill directory or prompt file to optimize	`../workspace/skills/brain-search/SKILL.md`
`--evals`	Path to eval definition JSON file	`eval.json`
`--provider`	LLM provider for running experiments	`minimax` (default), `openai`, `anthropic`
`--runs`	Number of runs per experiment (statistical significance)	`5` (default)
`--max-experiments`	Maximum experiments before stopping	`30` (default)
`--dashboard`	Open live results dashboard in browser	flag, no value

Step 2: Create eval.json

Define test inputs and evaluation criteria. Each eval is a binary pass/fail check.

{
  "test_inputs": [
    "search for latest AI agent frameworks",
    "find news about LLM inference optimization",
    "搜一下 transformer 架构的最新进展"
  ],
  "evals": [
    {
      "name": "has_sources",
      "type": "rule",
      "rule": "regex",
      "pattern": "(https?://|Source:|来源:)"
    },
    {
      "name": "no_hallucinated_urls",
      "type": "rule",
      "rule": "banned_phrases",
      "phrases": ["example.com", "placeholder.url"]
    },
    {
      "name": "sufficient_detail",
      "type": "rule",
      "rule": "word_count",
      "min": 50,
      "max": 500
    },
    {
      "name": "contains_summary",
      "type": "rule",
      "rule": "contains",
      "values": ["summary", "key findings", "结论"]
    },
    {
      "name": "no_apology_prefix",
      "type": "rule",
      "rule": "not_contains",
      "values": ["I apologize", "I'm sorry, but"]
    },
    {
      "name": "actionable_output",
      "type": "llm",
      "question": "Does the response provide actionable information the user can immediately use (links, specific facts, concrete next steps)?",
      "pass_description": "The response contains specific actionable items like URLs, concrete facts, or clear next steps",
      "fail_description": "The response is vague, generic, or lacks specific actionable information"
    }
  ]
}

Rule types:

Rule	Parameters	Description
`regex`	`pattern`	Pass if regex matches output
`banned_phrases`	`phrases` (list)	Pass if NONE of the phrases appear
`word_count`	`min`, `max` (optional)	Pass if word count is within range
`contains`	`values` (list), optional `match`: `"any"` (default) or `"all"`	Pass if any/all values appear in output (case-insensitive)
`not_contains`	`values` (list)	Pass if NONE of the values appear in output (case-insensitive)

LLM eval type:

Field	Description
`type`	Must be `"llm"`
`name`	Unique name for this eval
`question`	What to ask the judge LLM about the output
`pass_description`	Description of what a passing output looks like
`fail_description`	Description of what a failing output looks like

See eval-guide.md for detailed guidance on writing effective evals.

Step 3: Run autoresearch

python autoresearch.py \
  --target ../workspace/skills/brain-search/SKILL.md \
  --evals eval.json \
  --provider minimax \
  --runs 5 \
  --max-experiments 30 \
  --dashboard

Step 4: Review results and apply changes

The script writes results to results.tsv in the working directory. Each row is one experiment:

experiment_id  parent_id  mutation_description  avg_score  pass_rate  evals_detail  prompt_diff

Find the best performing variant:

cat results.tsv | sort -k4 -nr | head -5

Apply the winning prompt to your skill by copying the optimized prompt text to replace the original.

Example: optimizing brain-search

User: brain-search 的搜索结果经常缺少来源链接，帮我优化一下

完整流程:

1. 创建 eval.json:
   {
     "test_inputs": [
       "search for latest news on OpenAI",
       "搜一下最新的 AI 芯片进展",
       "find recent papers on RAG optimization",
       "what happened with Anthropic this week",
       "查查 GPU 价格趋势"
     ],
     "evals": [
       {
         "name": "has_urls",
         "type": "rule",
         "rule": "regex",
         "pattern": "https?://[^\\s]+"
       },
       {
         "name": "min_2_sources",
         "type": "rule",
         "rule": "regex",
         "pattern": "https?://[^\\s]+.*https?://[^\\s]+"
       },
       {
         "name": "structured_output",
         "type": "llm",
         "question": "Is the output well-structured with clear sections?",
         "pass_description": "Output uses clear structure like bullets or headers",
         "fail_description": "Output is a wall of text without clear structure"
       }
     ]
   }

2. 运行命令:
   python autoresearch.py \
     --target ../workspace/skills/brain-search/SKILL.md \
     --evals eval.json \
     --runs 5 \
     --max-experiments 20

3. 查看并应用结果:
   - 检查 results.tsv 找最高分变体
   - 查看 mutation_description 了解关键改动
   - 将最佳 prompt 应用到原始 SKILL.md

Failure handling

Issue	Action
LLM API rate limit	Script auto-retries with backoff; if persistent, reduce `--runs`
Target file not found	Check path, must be readable prompt/skill file
All experiments score 0	Evals may be too strict — review eval definitions, loosen criteria
Script crashes mid-run	Results already written to `results.tsv` are preserved; re-run continues

Gotchas

每次实验会调用 LLM 多次（runs x test_inputs x llm_evals），注意 API 用量 / Each experiment makes multiple LLM calls — watch API usage
LLM eval 本身有噪声，--runs 设高一点（5+）才有统计意义 / LLM evals are noisy, use 5+ runs for statistical significance
Rule evals 比 LLM evals 更稳定、更便宜，优先用 rule / Rule evals are more stable and cheaper — prefer them
Baseline 分数太低（< 20%）说明 eval 定义可能有问题，先修 eval / If baseline score is very low, fix evals first
优化 prompt 不能解决架构问题（比如搜索 API 本身返回差结果） / Prompt optimization cannot fix architectural issues