Brainforge Autoresearch

v0.2.5

Use when user wants to optimize, improve, benchmark, or evaluate a skill's prompt. Triggers on "optimize skill", "improve skill prompt", "benchmark skill", "...

⭐ 0· 90·0 current·0 all-time

byZHANG Ning@zning1994

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for zning1994/brainforge-autoresearch.

Previewing Install & Setup.

Prompt PreviewInstall & Setup

Install the skill "Brainforge Autoresearch" (zning1994/brainforge-autoresearch) from ClawHub.
Skill page: https://clawhub.ai/zning1994/brainforge-autoresearch
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install brainforge-autoresearch

ClawHub CLI

Package manager switcher

npx clawhub@latest install brainforge-autoresearch

Security Scan

Capability signals

CryptoCan make purchasesRequires sensitive credentials

These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.

VirusTotal

Benign

View report →

OpenClaw

Benign

high confidence

✓

Purpose & Capability

Name/description (prompt optimizer) match the included script and docs. Requesting an LLM API key (OPENAI_API_KEY / MINIMAX_API_KEY / ANTHROPIC_API_KEY) and Python is appropriate for running experiments against LLM providers.

ℹ

Instruction Scope

SKILL.md and autoresearch.py instruct the agent to read a target SKILL.md (or prompt file), generate mutations, run them against test inputs, and upload those requests to the configured LLM provider. This is within scope, but important to note: the tool will transmit the target prompt, test inputs, and generated mutations to third-party LLM endpoints as part of evaluation (expected behavior for this tool).

✓

Install Mechanism

Instruction-only skill with a bundled Python script; no external install downloads or package managers are used. The script uses only stdlib urllib/ssl for network calls — no high-risk remote install steps observed.

✓

Credentials

Metadata declares any of MINIMAX_API_KEY, OPENAI_API_KEY, or ANTHROPIC_API_KEY (with OPENAI_API_KEY as primary) and optional OPENAI_BASE_URL — these align with the providers implemented in the script. No unrelated secrets or excessive environment access are requested.

✓

Persistence & Privilege

always is false and the skill does not request elevated platform privileges. It writes output artifacts (results.tsv, dashboard.html, SKILL.md.baseline) to the working directory, which is expected for its function and scoped to its own outputs.

Assessment

This skill appears to do what it claims: mutate and test prompts by calling LLM APIs. Before installing or running: (1) be aware that the target prompt, your test inputs, and generated variants will be sent to the LLM provider you supply (OpenAI/Minimax/Anthropic or a compatible endpoint); do NOT point it at prompts containing secrets, credentials, or private data you don't want transmitted. (2) Limit which API key you provide and monitor usage/costs because the tool runs many model calls. (3) Review autoresearch.py (network endpoints and logging) if you need to verify no unexpected external hosts are used; you can set OPENAI_BASE_URL to a self-hosted compatible endpoint if you prefer. (4) The script backs up the original SKILL.md to SKILL.md.baseline, but still treat file writes as local modifications and run in a controlled workspace. Overall the components are coherent and proportionate to the stated purpose.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

OSmacOS · Linux

Any binpython3, python

Primary envOPENAI_API_KEY

latestvk97eyhrpg94b5r68jjmwx5h2yx85de0v

90downloads

0stars

1versions

Updated 5d ago

v0.2.5

MIT-0

macOS, Linux

brainforge-autoresearch

Previously published as autoresearch / openclaw-autoresearch. Renamed for the brainforge marketplace rollout — functionality unchanged.

Autonomous prompt optimization for AI agent skills. Runs controlled experiments to find better prompt variants using the Karpathy autoresearch pattern: generate hypothesis, mutate prompt, evaluate, repeat.

When to use

用户说"优化一下这个 skill" / User says "optimize this skill's prompt"
用户要对比不同 prompt 版本的效果 / User wants to benchmark prompt variants
用户说"run autoresearch on X" / "eval skill X" / "improve skill X"
用户对 skill 输出质量不满，想系统性改进 / User is unhappy with skill output quality and wants systematic improvement

Do not use:

一次性的小改动（直接改 prompt 即可） / One-off prompt tweaks — just edit the prompt directly
调试某个特定失败 case / Debugging a specific failure — investigate the root cause instead
Skill 脚本本身有 bug（代码逻辑问题不是 prompt 问题） / Skill script has a bug — fix the code, not the prompt

Requirements

Python 3.10+
autoresearch.py script in the skill directory
LLM API access (MiniMax, OpenAI, or Anthropic)
Target skill must have a prompt file (SKILL.md, SYSTEM.md, or similar)

Procedure

Always follow these steps in order: (1) Create eval.json, (2) Run autoresearch command, (3) Review results and apply best prompt.

Step 1: Gather context

Before running, you need:

Parameter	Description	Example
`--target`	Path to the skill directory or prompt file to optimize	`../workspace/skills/brain-search/SKILL.md`
`--evals`	Path to eval definition JSON file	`eval.json`
`--provider`	LLM provider for running experiments	`minimax` (default), `openai`, `anthropic`
`--runs`	Number of runs per experiment (statistical significance)	`5` (default)
`--max-experiments`	Maximum experiments before stopping	`30` (default)
`--dashboard`	Open live results dashboard in browser	flag, no value

Step 2: Create eval.json

Define test inputs and evaluation criteria. Each eval is a binary pass/fail check.

{
  "test_inputs": [
    "search for latest AI agent frameworks",
    "find news about LLM inference optimization",
    "搜一下 transformer 架构的最新进展"
  ],
  "evals": [
    {
      "name": "has_sources",
      "type": "rule",
      "rule": "regex",
      "pattern": "(https?://|Source:|来源:)"
    },
    {
      "name": "no_hallucinated_urls",
      "type": "rule",
      "rule": "banned_phrases",
      "phrases": ["example.com", "placeholder.url"]
    },
    {
      "name": "sufficient_detail",
      "type": "rule",
      "rule": "word_count",
      "min": 50,
      "max": 500
    },
    {
      "name": "contains_summary",
      "type": "rule",
      "rule": "contains",
      "values": ["summary", "key findings", "结论"]
    },
    {
      "name": "no_apology_prefix",
      "type": "rule",
      "rule": "not_contains",
      "values": ["I apologize", "I'm sorry, but"]
    },
    {
      "name": "actionable_output",
      "type": "llm",
      "question": "Does the response provide actionable information the user can immediately use (links, specific facts, concrete next steps)?",
      "pass_description": "The response contains specific actionable items like URLs, concrete facts, or clear next steps",
      "fail_description": "The response is vague, generic, or lacks specific actionable information"
    }
  ]
}

Rule types:

Rule	Parameters	Description
`regex`	`pattern`	Pass if regex matches output
`banned_phrases`	`phrases` (list)	Pass if NONE of the phrases appear
`word_count`	`min`, `max` (optional)	Pass if word count is within range
`contains`	`values` (list), optional `match`: `"any"` (default) or `"all"`	Pass if any/all values appear in output (case-insensitive)
`not_contains`	`values` (list)	Pass if NONE of the values appear in output (case-insensitive)

LLM eval type:

Field	Description
`type`	Must be `"llm"`
`name`	Unique name for this eval
`question`	What to ask the judge LLM about the output
`pass_description`	Description of what a passing output looks like
`fail_description`	Description of what a failing output looks like

See eval-guide.md for detailed guidance on writing effective evals.

Step 3: Run autoresearch

python autoresearch.py \
  --target ../workspace/skills/brain-search/SKILL.md \
  --evals eval.json \
  --provider minimax \
  --runs 5 \
  --max-experiments 30 \
  --dashboard

Step 4: Review results and apply changes

The script writes results to results.tsv in the working directory. Each row is one experiment:

experiment_id  parent_id  mutation_description  avg_score  pass_rate  evals_detail  prompt_diff

Find the best performing variant:

cat results.tsv | sort -k4 -nr | head -5

Apply the winning prompt to your skill by copying the optimized prompt text to replace the original.

Example: optimizing brain-search

User: brain-search 的搜索结果经常缺少来源链接，帮我优化一下

完整流程:

1. 创建 eval.json:
   {
     "test_inputs": [
       "search for latest news on OpenAI",
       "搜一下最新的 AI 芯片进展",
       "find recent papers on RAG optimization",
       "what happened with Anthropic this week",
       "查查 GPU 价格趋势"
     ],
     "evals": [
       {
         "name": "has_urls",
         "type": "rule",
         "rule": "regex",
         "pattern": "https?://[^\\s]+"
       },
       {
         "name": "min_2_sources",
         "type": "rule",
         "rule": "regex",
         "pattern": "https?://[^\\s]+.*https?://[^\\s]+"
       },
       {
         "name": "structured_output",
         "type": "llm",
         "question": "Is the output well-structured with clear sections?",
         "pass_description": "Output uses clear structure like bullets or headers",
         "fail_description": "Output is a wall of text without clear structure"
       }
     ]
   }

2. 运行命令:
   python autoresearch.py \
     --target ../workspace/skills/brain-search/SKILL.md \
     --evals eval.json \
     --runs 5 \
     --max-experiments 20

3. 查看并应用结果:
   - 检查 results.tsv 找最高分变体
   - 查看 mutation_description 了解关键改动
   - 将最佳 prompt 应用到原始 SKILL.md

Failure handling

Issue	Action
LLM API rate limit	Script auto-retries with backoff; if persistent, reduce `--runs`
Target file not found	Check path, must be readable prompt/skill file
All experiments score 0	Evals may be too strict — review eval definitions, loosen criteria
Script crashes mid-run	Results already written to `results.tsv` are preserved; re-run continues

Gotchas

每次实验会调用 LLM 多次（runs x test_inputs x llm_evals），注意 API 用量 / Each experiment makes multiple LLM calls — watch API usage
LLM eval 本身有噪声，--runs 设高一点（5+）才有统计意义 / LLM evals are noisy, use 5+ runs for statistical significance
Rule evals 比 LLM evals 更稳定、更便宜，优先用 rule / Rule evals are more stable and cheaper — prefer them
Baseline 分数太低（< 20%）说明 eval 定义可能有问题，先修 eval / If baseline score is very low, fix evals first
优化 prompt 不能解决架构问题（比如搜索 API 本身返回差结果） / Prompt optimization cannot fix architectural issues

Comments

Loading comments...