Install
openclaw skills install meta-harness-evolverEnd-to-end Meta-Harness evolution for Hoss (OpenClaw agent). Runs nightly at 3 AM via OpenClaw cron. Reads Hoss's current workspace configs (SOUL.md, IDENTITY.md, AGENTS.md, TOOLS.md, MEMORY.md), proposes harness modifications via a coding-agent proposer, evaluates against a benchmark, logs results to ~/hoss-evolution/, and posts a summary to the
openclaw skills install meta-harness-evolverImplements the Meta-Harness paper's outer-loop optimization for Hoss — your OpenClaw agent. Each night at 3 AM CDT, this skill:
Proposer Agent ──(filesystem access)──► Hoss Workspace
▲ │
│ propose harness
│ ▼
│ Evaluate on benchmark
│ ▼
log ───┴── store: code + scores + traces ──► ~/hoss-evolution/
openclaw cronSKILL=meta-harness-evolution TASK=run_evolution openclaw run/openclaw run --skill meta-harness-evolver --task run_evolution
~/hoss-evolution/
├── best/ # Best harness found so far
│ └── current/
├── candidates/ # All evaluated harnesses
│ └── candidate_N/ # One dir per candidate
│ ├── harness/ # The proposed config files (SOUL.md, etc.)
│ ├── eval_scores.json
│ └── traces/ # Execution traces
├── benchmark/ # Evaluation tasks + scorer
│ └── scenarios/ # ~20 diverse task scenarios
├── proposer/ # Proposer's workspace
│ └── logs/ # Proposer's own reasoning traces
└── evolution_log.jsonl # Full run history
Hoss's "harness" = the configs that wrap the LLM brain:
| File | What It Controls |
|---|---|
SOUL.md | Core identity, personality, decision-making style |
IDENTITY.md | Role, voice, tone, signature patterns |
AGENTS.md | Sub-agent architecture, coordination protocol |
TOOLS.md | Tool configurations, credentials, key hosts |
MEMORY.md | Long-term memory structure, what to persist |
HEARTBEAT.md | Active hours, check priorities, alert thresholds |
Constraints (do NOT modify):
The skill text is the strongest lever — it steers the proposer. Iterating on the proposer's prompt/role description had more effect than changing iteration count or population size.
The benchmark lives at ~/hoss-evolution/benchmark/. See references/benchmark-design.md for how to design scenarios and references/harness-spec.md for the full harness spec.
Default benchmark has 20 scenarios across categories:
Each scenario has:
The proposer is a coding-agent sub-agent (default: coder) that:
~/hoss-evolution/candidates/ via filesystem opsThe proposer's role is defined by the task prompt in scripts/propose_harness.py. Key constraints:
# List all prior candidates
ls ~/hoss-evolution/candidates/
# Read best candidate
cat ~/hoss-evolution/best/current/eval_scores.json
# Read history log
tail -20 ~/hoss-evolution/evolution_log.jsonl
# The sub-agent proposer reads ~/hoss-evolution/ and proposes
# This is triggered by openclaw run with this skill loaded
# Quick syntax check
bash ~/hoss-evolution/scripts/validate.sh <candidate_dir>
# Evaluate candidate against all 20 scenarios
python3 ~/hoss-evolution/scripts/evaluate.py <candidate_dir>
# Scores + traces written to candidate dir automatically
# Evolution log updated
# Posts summary to #research
python3 ~/hoss-evolution/scripts/post_to_research.py <candidate_dir>
Final score = weighted average across scenarios:
Results are tracked as a Pareto frontier: for each candidate, log both score and "complexity" (size/diff of changes). Simpler harnesses that score equally get priority.
runtime=subagent, not ACP — it needs filesystem access to ~/hoss-evolution/openclaw cron