Install
openclaw skills install @vincentjiang06/test-driven-developmentTest-driven development for NON-TRIVIAL behavior — write a failing test FIRST, watch it fail, then write minimal code to pass; the suite is a LIVING SPEC you EDIT/MERGE/DELETE as the target changes. Use when implementing real logic, fixing a bug, or changing tested behavior. Do NOT use for trivial edits or prototypes.
openclaw skills install @vincentjiang06/test-driven-developmentThe test suite is a living spec of the current target's behavior. TDD keeps the spec and the code in sync: adjust the test first, watch it fail, write the minimal code to pass. When the target changes, you update the spec — edit, merge, or delete tests — you do not pile new tests on top of stale ones.
Guard against both failure modes, equally:
Right-size every time: engage on behavior worth pinning down, stay out of mechanical churn. The old "TDD on everything, one test per micro-behavior" stance is failure mode (b) — don't.
Classify the change first. ENGAGE (write/update a failing test first) when it is:
DO NOT engage full TDD — just make the change — when it is:
timeout 30→60, a
log level, a UI label. But a constant that feeds logic is the opposite —
a tax rate in a price calc, a validation regex, a flag that gates a branch —
changing it is a behavior change: engage and update the test that asserts the
old value (modify mode). The qualifier is "no behavior depends on it," not "it
looks like a constant."One question is the authority — it outranks the lists above when they seem to conflict: "could this break in a way a test would catch?" Yes → engage; no → skip, state the one-line reason, move on. Don't rationalize in either direction — not "skip TDD just this once" on real logic, not "better add a test" on a rename.
Behavior-preserving work doesn't fit "write a failing test first": refactor with green tests covering it → don't write a new failing test, keep the existing ones green; refactor untested / "add tests to legacy" code → write characterization tests pinning current behavior first, then change. Full handling (the characterization "red" equivalent, why it fires on change not on mere lack of tests): references/refactor-and-legacy.md.
A feature-group is one coherent behavior with its edges (e.g. "email validation: empty / whitespace / valid"), not each individual assertion. Run one cycle per group:
The suite tracks the current target. Before adding anything, check what's there and prefer to change it: edit/strengthen an existing test, merge related behavior into one parametrized test, update or delete a test the target moved past, consolidate duplicates — add ONE group test only when nothing covers it. Net rule: for the same feature-group, the test count should not grow just because the code changed. The full situation→action table, worked examples + the per-stack consolidation patterns: references/modify-mode.md.
Keep the main thread fast and uncluttered. Dispatch these to subagents — in parallel when independent — and consume only their summary:
Don't do all of this inline and serially — that's the slow, redundant process this skill exists to replace.
NO production code for a behavior without first seeing its test FAIL — once per feature-group.
NO "red/green/done" claim without the actual command + its output + exit status.
If you didn't watch it fail, you don't know the test tests the right thing. For new or changed behavior, a test that passes the moment you write it is testing existing behavior or nothing — so it must fail first. (The sole exception is a deliberate characterization test pinning existing behavior before a refactor; there you confirm it exercises the target instead — see "Two special cases" above.) This gate is per feature-group (not per assertion, not per micro-behavior), and it is never skipped — not even when "it obviously fails." Confirm the failure is the expected one (feature missing), not a typo or import error.
Evidence, not honor. "Red confirmed" / "tests pass" / "done" are claims you must back with the run: the command you ran and its real output/exit code (the subagent you delegated the run to returns exactly this). Banned without evidence: "should pass", "this probably fails", "looks correct", "seems to work". An unverified claim is a defect. See enforcement-gates.md.
A passing regression test is worthless if it would pass without the fix. For every bug fix, after green, prove the test is real:
write failing test → watch it RED (right reason) → fix → GREEN
→ REVERT the fix → re-run: the test MUST go RED again → restore the fix → GREEN
If the test stays green with the fix reverted, it doesn't test the bug — it's
vacuous. Strengthen it until reverting the fix turns it red. This is the single
strongest defense against green-but-useless tests, and the eval harness checks it
mechanically (evals/ auto-reverts and asserts red). Full pattern + the optional
context-isolated test-author and independent-verifier gates:
enforcement-gates.md.
Test what the code does, not what a mock does. Use real code; mock only what's genuinely unavoidable (true external I/O), and never assert on the mock itself. When adding mocks or test-only helpers, load references/testing-anti-patterns.md.
Can't check a box? You skipped a step — fix it before claiming done.
| File | Load when |
|---|---|
| references/enforcement-gates.md | The anti-gaming core: verification-evidence, revert-to-red, Beck's GREEN strategies, optional context-isolated test-author + independent-verifier subagents. |
| references/modify-mode.md | Once a suite exists: native collectors, edit/merge/delete decision, consolidation patterns. |
| references/refactor-and-legacy.md | The two special cases — behavior-preserving refactor, or "add tests to legacy / untested code" (characterization tests). |
| references/testing-anti-patterns.md | When adding mocks / test-only helpers — the over-mock and assert-on-mock traps. |
evals/ holds small real fixture repos (pytest / vitest) with scenarios
(must-engage, must-skip, modify-mode traps, anti-gaming traps). The grader runs
on the agent's resulting diff and auto-reverts the production change to confirm
each new test goes red (vacuity check), counts net test growth on modify traps
(proliferation), and checks zero new tests on negatives (right-size precision).
See evals/README.md.