Install
openclaw skills install agentic-research-harnessCognitive discipline for AI-native scientific experimentation. Trigger when setting up controlled experiments with LLM agents, designing reproducible evaluation pipelines, or structuring research workspaces for long-running agent collaboration. Provides guardrails, not recipes — teaches agents how to reason about experiments, not which commands to run.
openclaw skills install agentic-research-harnessVersion: 1.3.0 Cognitive discipline for AI-native scientific experimentation — guardrails, not recipes.
Trigger this skill when the user:
This skill does not prescribe what experiment to run. It prescribes how to think while running it.
Research agents fail not because they lack capability, but because they:
The antidote is cognitive discipline — a set of non-negotiable mental habits enforced by repo structure, not by prompt reminders. Detailed reasoning for each discipline is in references/scientific-thinking.md.
A critical addition from real-world practice: governance before scale. When a closed loop produces strong signals, the most tempting mistake is to expand immediately. The correct move is to lock down claim boundaries, audit artifacts, fix provenance gaps, and only then scale. A dedicated calibration phase between closed-loop and full-scale expansion is a sign of maturity, not delay.
| # | Discipline | Core Question | Deep dive |
|---|---|---|---|
| 1 | Minimum Closed Loop Before Scale | Can the smallest version produce distinguishable signals? | references/experiment-design.md |
| 2 | Isolated Variables & Attributable Baselines | Does each group add exactly one variable? | references/experiment-design.md |
| 3 | Dual-Track Validation | Do two independent scoring systems agree? | references/scoring-statistics.md |
| 4 | Effect Size Over Significance | What is the magnitude, not just the p-value? | references/scoring-statistics.md |
| 5 | Pipeline Before Interpretation | Was the execution chain verified before the hypothesis was questioned? | references/scientific-thinking.md |
| 6 | Theoretical Grounding Before Design | Can every design decision trace to a published precedent or an explicitly stated hypothesis? | references/methodology-grounding.md |
Disciplines 1-2: experiment design. 3-4: scoring & statistics. 5: critical reasoning. 6: methodology accountability.
| # | Rule | Principle |
|---|---|---|
| 1 | Human Owns Direction; Agent Owns Execution | Agent cannot change research questions, promote evidence without review, or make academic decisions |
| 2 | Evidence Has Status; AI Output Is Not Fact | All AI-generated evidence starts as candidate; only back-to-source verification promotes to verified |
| 3 | Failed Runs Are Data, Not Trash | Register every run in the manifest; failures are process evidence against survivorship bias |
| 4 | Protected Surfaces Change Only By Proposal | Baselines, rubrics, raw results, and schema require version bump + documented proposal |
| 5 | Every Handoff Needs an Alignment Doc | Short doc replaces long chat history for agent onboarding |
| 6 | Calibrate Before Scaling | After a closed loop produces strong signals, lock down claim boundaries, audit artifacts, and fix provenance before expanding to full scale |
| 7 | Methodology Review Gate | Any "novel" design (no direct precedent) must have a Design Justification Document with ≥2 published precedents or explicit assumption declarations before entering experiment execution |
Details in references/agent-collaboration.md.
Goal: Set up the three-layer repo and root entry files.
thinking-space/ — research direction, claims, decisions (human)execution-layer/ — briefs, logs, results, drafts (agent)code-workshop/ — runnable artifacts, packagesRoot files: AGENTS.md (workspace map), PLAN.md (phase panel), WORKFLOW.md (procedure), harness/README.md (governance).
Directory skeleton and rationale: references/repo-architecture.md.
Goal: Make the repo self-checking before formal execution.
CONTRACT.md (purpose, inputs, outputs, invariants, local validator). Template in references/repo-architecture.md.scripts/validate_<module>.py per module; scripts/validate_repo_state.py as aggregator. Gate rule: 0 FAIL before any formal run.experiments/results/manifest.csv as run-level provenance ledger (run_id, wave, task_id, group, model, version metadata, status, retry_of, git_commit).Goal: Design attributable controlled experiments with grounded methodology.
references/experiment-design.md.must_include, forbidden, and scoring_notes. For multi-group experiments, separate gold checklists into planning_gold (items all groups can achieve) and evidence_gold (items only augmented groups can access).evidence_access_required (true/false). Score families with different access requirements must be reported separately, never mixed into a single total. Details in references/scoring-statistics.md.references/repo-architecture.md.references/methodology-grounding.md. This is required, not optional.Goal: Run experiments, score, compute statistics, analyze errors.
Preflight gate: local validators must pass. Then:
references/scoring-statistics.md.--reproduce flag for one-click reproducibility.Goal: Package results for the next phase or agent.
sync/upstream_proposals/ first. Template in references/agent-collaboration.md.[REF-MISSING], [CRITICAL-CHECK], [TODO]. Never use AI numbers without verification.references/repo-architecture.md — three-layer repo, module contracts, manifest, validatorsreferences/experiment-design.md — progressive building, controlled groups, gold checklistsreferences/scoring-statistics.md — dual-track validation, effect size, reproducibilityreferences/scientific-thinking.md — cognitive disciplines for agent-led researchreferences/agent-collaboration.md — governance, evidence status, alignment docsreferences/methodology-grounding.md — theoretical grounding for schema, scoring, and experiment design