Brainforge Autoresearch

v0.2.5

Use when user wants to optimize, improve, benchmark, or evaluate a skill's prompt. Triggers on "optimize skill", "improve skill prompt", "benchmark skill", "...

0· 90·0 current·0 all-time
byZHANG Ning@zning1994

Install

OpenClaw Prompt Flow

Install with OpenClaw

Best for remote or guided setup. Copy the exact prompt, then paste it into OpenClaw for zning1994/brainforge-autoresearch.

Previewing Install & Setup.
Prompt PreviewInstall & Setup
Install the skill "Brainforge Autoresearch" (zning1994/brainforge-autoresearch) from ClawHub.
Skill page: https://clawhub.ai/zning1994/brainforge-autoresearch
Keep the work scoped to this skill only.
After install, inspect the skill metadata and help me finish setup.
Use only the metadata you can verify from ClawHub; do not invent missing requirements.
Ask before making any broader environment changes.

Command Line

CLI Commands

Use the direct CLI path if you want to install manually and keep every step visible.

OpenClaw CLI

Bare skill slug

openclaw skills install brainforge-autoresearch

ClawHub CLI

Package manager switcher

npx clawhub@latest install brainforge-autoresearch
Security Scan
Capability signals
CryptoCan make purchasesRequires sensitive credentials
These labels describe what authority the skill may exercise. They are separate from suspicious or malicious moderation verdicts.
VirusTotalVirusTotal
Benign
View report →
OpenClawOpenClaw
Benign
high confidence
Purpose & Capability
Name/description (prompt optimizer) match the included script and docs. Requesting an LLM API key (OPENAI_API_KEY / MINIMAX_API_KEY / ANTHROPIC_API_KEY) and Python is appropriate for running experiments against LLM providers.
Instruction Scope
SKILL.md and autoresearch.py instruct the agent to read a target SKILL.md (or prompt file), generate mutations, run them against test inputs, and upload those requests to the configured LLM provider. This is within scope, but important to note: the tool will transmit the target prompt, test inputs, and generated mutations to third-party LLM endpoints as part of evaluation (expected behavior for this tool).
Install Mechanism
Instruction-only skill with a bundled Python script; no external install downloads or package managers are used. The script uses only stdlib urllib/ssl for network calls — no high-risk remote install steps observed.
Credentials
Metadata declares any of MINIMAX_API_KEY, OPENAI_API_KEY, or ANTHROPIC_API_KEY (with OPENAI_API_KEY as primary) and optional OPENAI_BASE_URL — these align with the providers implemented in the script. No unrelated secrets or excessive environment access are requested.
Persistence & Privilege
always is false and the skill does not request elevated platform privileges. It writes output artifacts (results.tsv, dashboard.html, SKILL.md.baseline) to the working directory, which is expected for its function and scoped to its own outputs.
Assessment
This skill appears to do what it claims: mutate and test prompts by calling LLM APIs. Before installing or running: (1) be aware that the target prompt, your test inputs, and generated variants will be sent to the LLM provider you supply (OpenAI/Minimax/Anthropic or a compatible endpoint); do NOT point it at prompts containing secrets, credentials, or private data you don't want transmitted. (2) Limit which API key you provide and monitor usage/costs because the tool runs many model calls. (3) Review autoresearch.py (network endpoints and logging) if you need to verify no unexpected external hosts are used; you can set OPENAI_BASE_URL to a self-hosted compatible endpoint if you prefer. (4) The script backs up the original SKILL.md to SKILL.md.baseline, but still treat file writes as local modifications and run in a controlled workspace. Overall the components are coherent and proportionate to the stated purpose.

Like a lobster shell, security has layers — review code before you run it.

Runtime requirements

OSmacOS · Linux
Any binpython3, python
Primary envOPENAI_API_KEY
latestvk97eyhrpg94b5r68jjmwx5h2yx85de0v
90downloads
0stars
1versions
Updated 5d ago
v0.2.5
MIT-0
macOS, Linux

brainforge-autoresearch

Previously published as autoresearch / openclaw-autoresearch. Renamed for the brainforge marketplace rollout — functionality unchanged.

Autonomous prompt optimization for AI agent skills. Runs controlled experiments to find better prompt variants using the Karpathy autoresearch pattern: generate hypothesis, mutate prompt, evaluate, repeat.

When to use

  • 用户说"优化一下这个 skill" / User says "optimize this skill's prompt"
  • 用户要对比不同 prompt 版本的效果 / User wants to benchmark prompt variants
  • 用户说"run autoresearch on X" / "eval skill X" / "improve skill X"
  • 用户对 skill 输出质量不满,想系统性改进 / User is unhappy with skill output quality and wants systematic improvement

Do not use:

  • 一次性的小改动(直接改 prompt 即可) / One-off prompt tweaks — just edit the prompt directly
  • 调试某个特定失败 case / Debugging a specific failure — investigate the root cause instead
  • Skill 脚本本身有 bug(代码逻辑问题不是 prompt 问题) / Skill script has a bug — fix the code, not the prompt

Requirements

  • Python 3.10+
  • autoresearch.py script in the skill directory
  • LLM API access (MiniMax, OpenAI, or Anthropic)
  • Target skill must have a prompt file (SKILL.md, SYSTEM.md, or similar)

Procedure

Always follow these steps in order: (1) Create eval.json, (2) Run autoresearch command, (3) Review results and apply best prompt.

Step 1: Gather context

Before running, you need:

ParameterDescriptionExample
--targetPath to the skill directory or prompt file to optimize../workspace/skills/brain-search/SKILL.md
--evalsPath to eval definition JSON fileeval.json
--providerLLM provider for running experimentsminimax (default), openai, anthropic
--runsNumber of runs per experiment (statistical significance)5 (default)
--max-experimentsMaximum experiments before stopping30 (default)
--dashboardOpen live results dashboard in browserflag, no value

Step 2: Create eval.json

Define test inputs and evaluation criteria. Each eval is a binary pass/fail check.

{
  "test_inputs": [
    "search for latest AI agent frameworks",
    "find news about LLM inference optimization",
    "搜一下 transformer 架构的最新进展"
  ],
  "evals": [
    {
      "name": "has_sources",
      "type": "rule",
      "rule": "regex",
      "pattern": "(https?://|Source:|来源:)"
    },
    {
      "name": "no_hallucinated_urls",
      "type": "rule",
      "rule": "banned_phrases",
      "phrases": ["example.com", "placeholder.url"]
    },
    {
      "name": "sufficient_detail",
      "type": "rule",
      "rule": "word_count",
      "min": 50,
      "max": 500
    },
    {
      "name": "contains_summary",
      "type": "rule",
      "rule": "contains",
      "values": ["summary", "key findings", "结论"]
    },
    {
      "name": "no_apology_prefix",
      "type": "rule",
      "rule": "not_contains",
      "values": ["I apologize", "I'm sorry, but"]
    },
    {
      "name": "actionable_output",
      "type": "llm",
      "question": "Does the response provide actionable information the user can immediately use (links, specific facts, concrete next steps)?",
      "pass_description": "The response contains specific actionable items like URLs, concrete facts, or clear next steps",
      "fail_description": "The response is vague, generic, or lacks specific actionable information"
    }
  ]
}

Rule types:

RuleParametersDescription
regexpatternPass if regex matches output
banned_phrasesphrases (list)Pass if NONE of the phrases appear
word_countmin, max (optional)Pass if word count is within range
containsvalues (list), optional match: "any" (default) or "all"Pass if any/all values appear in output (case-insensitive)
not_containsvalues (list)Pass if NONE of the values appear in output (case-insensitive)

LLM eval type:

FieldDescription
typeMust be "llm"
nameUnique name for this eval
questionWhat to ask the judge LLM about the output
pass_descriptionDescription of what a passing output looks like
fail_descriptionDescription of what a failing output looks like

See eval-guide.md for detailed guidance on writing effective evals.

Step 3: Run autoresearch

python autoresearch.py \
  --target ../workspace/skills/brain-search/SKILL.md \
  --evals eval.json \
  --provider minimax \
  --runs 5 \
  --max-experiments 30 \
  --dashboard

Step 4: Review results and apply changes

The script writes results to results.tsv in the working directory. Each row is one experiment:

experiment_id  parent_id  mutation_description  avg_score  pass_rate  evals_detail  prompt_diff

Find the best performing variant:

cat results.tsv | sort -k4 -nr | head -5

Apply the winning prompt to your skill by copying the optimized prompt text to replace the original.

Example: optimizing brain-search

User: brain-search 的搜索结果经常缺少来源链接,帮我优化一下

完整流程:

1. 创建 eval.json:
   {
     "test_inputs": [
       "search for latest news on OpenAI",
       "搜一下最新的 AI 芯片进展",
       "find recent papers on RAG optimization",
       "what happened with Anthropic this week",
       "查查 GPU 价格趋势"
     ],
     "evals": [
       {
         "name": "has_urls",
         "type": "rule",
         "rule": "regex",
         "pattern": "https?://[^\\s]+"
       },
       {
         "name": "min_2_sources",
         "type": "rule",
         "rule": "regex",
         "pattern": "https?://[^\\s]+.*https?://[^\\s]+"
       },
       {
         "name": "structured_output",
         "type": "llm",
         "question": "Is the output well-structured with clear sections?",
         "pass_description": "Output uses clear structure like bullets or headers",
         "fail_description": "Output is a wall of text without clear structure"
       }
     ]
   }

2. 运行命令:
   python autoresearch.py \
     --target ../workspace/skills/brain-search/SKILL.md \
     --evals eval.json \
     --runs 5 \
     --max-experiments 20

3. 查看并应用结果:
   - 检查 results.tsv 找最高分变体
   - 查看 mutation_description 了解关键改动
   - 将最佳 prompt 应用到原始 SKILL.md

Failure handling

IssueAction
LLM API rate limitScript auto-retries with backoff; if persistent, reduce --runs
Target file not foundCheck path, must be readable prompt/skill file
All experiments score 0Evals may be too strict — review eval definitions, loosen criteria
Script crashes mid-runResults already written to results.tsv are preserved; re-run continues

Gotchas

  • 每次实验会调用 LLM 多次(runs x test_inputs x llm_evals),注意 API 用量 / Each experiment makes multiple LLM calls — watch API usage
  • LLM eval 本身有噪声,--runs 设高一点(5+)才有统计意义 / LLM evals are noisy, use 5+ runs for statistical significance
  • Rule evals 比 LLM evals 更稳定、更便宜,优先用 rule / Rule evals are more stable and cheaper — prefer them
  • Baseline 分数太低(< 20%)说明 eval 定义可能有问题,先修 eval / If baseline score is very low, fix evals first
  • 优化 prompt 不能解决架构问题(比如搜索 API 本身返回差结果) / Prompt optimization cannot fix architectural issues

Comments

Loading comments...