Install
openclaw skills install agent-research-harnessCognitive discipline for AI-native scientific experimentation. Trigger when setting up controlled experiments with LLM agents, designing reproducible evaluation pipelines, or structuring research workspaces for long-running agent collaboration. Provides guardrails, not recipes — teaches agents how to reason about experiments, not which commands to run.
openclaw skills install agent-research-harnessVersion: 1.0.0 Cognitive discipline for AI-native scientific experimentation — guardrails, not recipes.
Trigger this skill when the user:
This skill does not prescribe what experiment to run. It prescribes how to think while running it.
Research agents fail not because they lack capability, but because they:
The antidote is cognitive discipline — a set of non-negotiable mental habits enforced by repo structure, not by prompt reminders. Detailed reasoning for each discipline is in references/scientific-thinking.md.
| # | Discipline | Core Question | Deep dive |
|---|---|---|---|
| 1 | Minimum Closed Loop Before Scale | Can the smallest version produce distinguishable signals? | references/experiment-design.md |
| 2 | Isolated Variables & Attributable Baselines | Does each group add exactly one variable? | references/experiment-design.md |
| 3 | Dual-Track Validation | Do two independent scoring systems agree? | references/scoring-statistics.md |
| 4 | Effect Size Over Significance | What is the magnitude, not just the p-value? | references/scoring-statistics.md |
| 5 | Pipeline Before Interpretation | Was the execution chain verified before the hypothesis was questioned? | references/scientific-thinking.md |
Disciplines 1-2: experiment design. 3-4: scoring & statistics. 5: critical reasoning.
| # | Rule | Principle |
|---|---|---|
| 1 | Human Owns Direction; Agent Owns Execution | Agent cannot change research questions, promote evidence without review, or make academic decisions |
| 2 | Evidence Has Status; AI Output Is Not Fact | All AI-generated evidence starts as candidate; only back-to-source verification promotes to verified |
| 3 | Failed Runs Are Data, Not Trash | Register every run in the manifest; failures are process evidence against survivorship bias |
| 4 | Protected Surfaces Change Only By Proposal | Baselines, rubrics, raw results, and schema require version bump + documented proposal |
| 5 | Every Handoff Needs an Alignment Doc | Short doc replaces long chat history for agent onboarding |
Details in references/agent-collaboration.md.
Goal: Set up the three-layer repo and root entry files.
thinking-space/ — research direction, claims, decisions (human)execution-layer/ — briefs, logs, results, drafts (agent)code-workshop/ — runnable artifacts, packagesRoot files: AGENTS.md (workspace map), PLAN.md (phase panel), WORKFLOW.md (procedure), harness/README.md (governance).
Directory skeleton and rationale: references/repo-architecture.md.
Goal: Make the repo self-checking before formal execution.
CONTRACT.md (purpose, inputs, outputs, invariants, local validator). Template in references/repo-architecture.md.scripts/validate_<module>.py per module; scripts/validate_repo_state.py as aggregator. Gate rule: 0 FAIL before any formal run.experiments/results/manifest.csv as run-level provenance ledger (run_id, wave, task_id, group, model, version metadata, status, retry_of, git_commit).Goal: Design attributable controlled experiments.
references/experiment-design.md.must_include, forbidden, and scoring_notes.Goal: Run experiments, score, compute statistics, analyze errors.
Preflight gate: local validators must pass. Then:
references/scoring-statistics.md.--reproduce flag for one-click reproducibility.Goal: Package results for the next phase or agent.
sync/upstream_proposals/ first. Template in references/agent-collaboration.md.[REF-MISSING], [CRITICAL-CHECK], [TODO]. Never use AI numbers without verification.references/repo-architecture.md — three-layer repo, module contracts, manifest, validatorsreferences/experiment-design.md — progressive building, controlled groups, gold checklistsreferences/scoring-statistics.md — dual-track validation, effect size, reproducibilityreferences/scientific-thinking.md — cognitive disciplines for agent-led researchreferences/agent-collaboration.md — governance, evidence status, alignment docs