Install
openclaw skills install @snake-fan/07-experiment-designUse when the user has a method idea or Committed Method Design and reads papers to design tasks, datasets, baselines, metrics, ablations, human evaluation, user studies, or a claim-to-evidence experiment plan.
openclaw skills install @snake-fan/07-experiment-designUse this skill to turn a stable method's claims into an experiment plan. Treat the skill as a Claim -> Evidence -> Protocol -> Reviewer Defense converter: it does not complete the method, and it does not begin from benchmark, baseline, or metric lists. Baseline selection and metric selection are part of this workflow, not separate downstream skills.
If the Method Thesis, Mechanistic Claim, target failure, or intervention point is still unclear, route the user back to method commitment instead of inventing experiments for an unstable method.
Set {workspace-root} before creating, scanning, or updating artifacts:
{workspace-root} to workspace (the repo-local workspace/ directory).workspace/ layer.{workspace-root} from that path and keep related artifacts under the same root.Create or resume one experiment folder at:
{workspace-root}/experiment-designs/{field-slug}/{method-or-question-slug}/
Use references/workspace-structure.md for the artifact layout.
{workspace-root}/experiment-designs/{field-slug}/{method-or-question-slug}/.source_experiment_context.md.claim_evidence_map.md.experiment_design.md with the proposed tasks, data, protocol, comparisons, human evaluation or user study plan when needed, expected evidence, reviewer-objection coverage, and result interpretation contract.baseline_pressure_matrix.md, choosing baselines that fairly pressure-test the method rather than merely filling a list.claim_metric_map.md, mapping each key claim to observables, metrics, measurement protocol, validity risks, and failure interpretation.ablation_and_controls.md, including component ablations, confound checks, stress tests, and negative or sanity checks.If an experiment folder already exists, read the current artifacts first. Preserve user edits and update existing files instead of overwriting them blindly.
This is a stepwise workflow, not a one-shot artifact generator. Ask one bounded decision packet at a time, provide the recommended answer, wait for the user's confirmation, revision, or explicit delegation, then update the relevant artifact before moving to dependent decisions.
Use concise decision packets. A packet should show only the decision now needed, why it matters, the recommended answer, and the consequence of accepting it. Do not ask the user to approve a whole artifact when one unresolved decision would redirect downstream work.
Do not treat silence, time pressure, or a broad opening instruction such as "you decide" as approval for every later gate. Delegation counts only after the user has seen the specific gate or packet recommendation. Delegation is scoped to that gate or packet and must be recorded in the relevant artifact.
If the user explicitly asks for a fully automatic run, produce only a provisional experiment sketch. The artifacts may organize likely claims, routes, baselines, metrics, and ablations, but they must not be presented as a final experiment plan. Record which gates were not reviewed and which high-risk choices remain delegated or unresolved.
Artifacts may be drafted before a gate passes, but downstream artifacts must not look final or drive later choices until the relevant gate record is updated. In gate records, keep the distinction lightweight but explicit: passed means the gate can continue; confirmed, explicitly delegated, mixed, or not reviewed describe the user's decision mode.
Prefer a committed_method_design.md from paper-reading-method-commitment. It is the normal source for experiment design.
Allowed sources:
The Source Gate uses a small source decision packet. If there is one clear Committed Method Design, show only:
If there are multiple possible sources, ask the user to choose exactly one. If the source lacks a stable Method Thesis, Mechanistic Claim, target failure, intervention point, or target outcome, stop and recommend method commitment rather than starting experiment design from a rough natural-language method description.
The gate is passed only after the user confirms or explicitly delegates the source decision packet and source_experiment_context.md records:
If the source method is not committed, mark the experiment design as provisional and do not present it as final validation evidence.
Before writing claim_evidence_map.md, inspect the available local context. Default to local artifacts; do not start a broad web search until the source claims and evidence routes are explicit.
For a Committed Method Design source, read relevant available artifacts:
committed_method_design.mdmethod_commitment_summary.mdmethod_reconstruction.mdmethod_attack_transcript.mdmethod_decision_log.mdFor a Research Question Card, Source Problem Brief, rough method, or user-provided source, read relevant available artifacts:
source_problem.mdmethod_need_decomposition.mdcandidate_methods.mdmethod_candidate_library.mdPreserve inherited fragilities. If the source lacks a stable Method Thesis, Mechanistic Claim, target failure, intervention point, or target outcome, record the gap in source_experiment_context.md and route back to method commitment instead of fabricating an experiment plan.
Use this section in source_experiment_context.md when the source is a Research Question Card, Source Problem Brief, rough method, or otherwise not a Committed Method Design.
Record:
If the Method Thesis, Mechanistic Claim, target failure, and intervention point cannot be stated without invention, stop and recommend method commitment before experiment design continues.
Do not start from a list of datasets or metrics. First map what must be proven or could be refuted.
For each key claim, record:
Every later task, baseline, metric, and ablation should trace back to at least one claim.
Before searching for benchmarks, baselines, or metrics, ask the user to confirm, revise, or explicitly delegate the Claim-Evidence Map. The gate is passed only when the key claims and evidence routes are explicit enough that later choices can be traced back to them.
Do not show the whole Claim-Evidence Map and ask for one approval. Present one core claim decision packet at a time. Each packet should include:
Low-risk claims may be grouped only when they share the same evidence route and have no distinct proxy or target-failure risk.
Use live challenge questions for high-risk claim decisions, especially when:
Ask one high-risk challenge at a time. Each challenge should state the skeptical claim, hidden assumption, why the current route may fail, the recommended answer, and the consequence of accepting or revising the route. Wait for the user's response or explicit delegation before marking that claim's route accepted.
Record the user's response or explicit delegation in claim_evidence_map.md.
Choose evidence routes claim by claim. Not every claim needs a large experiment, but every core claim needs at least one explicit evidence route.
Common evidence routes:
Use an Experiment Stack to avoid proving only a narrow main effect while making broader claims. Include only layers that match the source method's actual claims:
Read papers to identify reviewer-recognizable evaluation settings and reusable protocols. Prioritize tasks and datasets that pressure-test the Mechanistic Claim, not just convenient benchmarks.
For each task or dataset, record:
If no existing task fits the claim, propose a new task or data construction route and mark the missing benchmark evidence explicitly.
Baseline selection belongs here.
Choose baselines by pressure type:
For each baseline, record:
Do not include a baseline only because it is popular. Do not exclude close work because it is hard unless the limitation is recorded.
Map baseline pressure to reviewer objections:
Metric selection belongs here.
For each claim, map:
Avoid standalone metric banks. A metric is useful only when it measures a claim in a specific protocol.
Metric selection should make a claim observable. If the metric only measures a proxy, record whether the proxy is strong, weak, anecdotal, or speculative evidence and name the construct mismatch.
If the claim involves understanding, learning gain, personalization, Theory of Mind accuracy, dialogue quality, safety, or human trust, prefer validated metrics or established human-evaluation instruments from prior papers when available. Record construct mismatch rather than pretending a proxy is direct evidence.
Use ablations to test the Mechanistic Claim.
Include:
Each ablation must explain which mechanism or assumption it tests.
For each ablation or control, record:
Before finalizing experiment_design.md, write a result interpretation contract that states how major result patterns should change the claim, method, or paper story.
At minimum, cover:
Before presenting the artifacts as a final experiment plan, ask the user to confirm, revise, or explicitly delegate the experiment design decisions. This is a workflow validation gate, not a human evaluation experiment.
Do not wait until all artifacts are finished and ask for one approval. Review the plan through small packets in dependency order:
The gate is passed only when the user accepts or delegates:
If the user rejects a key decision, update the affected artifact immediately before continuing. If the user delegates, record the delegation and keep the design status provisional unless the source method is committed and no unresolved gate issue remains.
Do not skip these gates:
If a gate is delegated, record it as the user's decision mode rather than rewriting it as confirmed. If a gate is unreviewed because the user requested a fully automatic run, keep the output provisional.
Create or update these files in the experiment folder:
source_experiment_context.mdclaim_evidence_map.mdexperiment_design.mdbaseline_pressure_matrix.mdclaim_metric_map.mdablation_and_controls.mdArtifact responsibilities:
source_experiment_context.md: source status, Minimal Experiment Brief when needed, Method Thesis, Mechanistic Claim, target failure, intervention point, scope, non-goals, weakest link, and inherited warningsclaim_evidence_map.md: claim decomposition, evidence routes, evidence strength, reviewer-objection links, and Claim-Evidence Review Gate recordexperiment_design.md: Experiment Stack, tasks, data, protocols, human evaluation or user study plan when needed, expected evidence, failure analysis, and Result Interpretation Contractbaseline_pressure_matrix.md: baselines organized by pressure type, claim attacked, fairness, reproducibility, reviewer objection, and remaining weaknessclaim_metric_map.md: observable constructs, primary and diagnostic metrics, measurement protocols, success criteria, validity risks, and failure interpretationsablation_and_controls.md: mechanism-linked ablations, controls, stress tests, sanity checks, confound checks, and what each result would meanUse the reference templates in this directory when creating these artifacts.
Use the reference templates in this directory:
references/workspace-structure.mdreferences/source-experiment-context-template.mdreferences/claim-evidence-map-template.mdreferences/experiment-design-template.mdreferences/baseline-pressure-matrix-template.mdreferences/claim-metric-map-template.mdreferences/ablation-and-controls-template.mdStop when every key claim has:
The user should be able to draft the experiment section as Claim -> Setting -> Baseline -> Metric -> Expected Evidence -> Failure Interpretation.