Install
openclaw skills install bookforge-build-refactoring-test-suiteBuild a sufficient automated test suite before refactoring existing code by applying a 6-step sequential construction workflow (test class → fixture → normal...
openclaw skills install bookforge-build-refactoring-test-suiteYou are about to refactor code — extracting methods, moving fields, changing conditionals — and one of these is true:
This is the Level 0 foundation skill: every other refactoring mechanic in Fowler's catalog assumes this suite exists. Without it, you are refactoring blind. With it, every subsequent step is reversible — if a test turns red, you revert and try smaller steps.
The core pattern: build a self-checking test suite that runs in seconds, covers the code you are about to change, and can answer one question without human inspection: "Did I break anything?"
Before starting, confirm you have:
pyproject.toml, package.json, pom.xml, go.mod). Use the framework already in place — do not introduce a new one.Scan the target code for:
testdata/ or fixtures/ directoryYou are ready to start when:
If you cannot determine what the code is supposed to do (no comments, no documentation, unclear naming), read the calling code or integration tests first to reconstruct the intended behavior before writing unit tests.
Create a dedicated test file for the code under refactoring. Place it where the project's test convention dictates (e.g., tests/test_order.py, src/__tests__/Order.test.ts, OrderTest.java).
Why: Each class under test needs its own test container. Mixing multiple classes into one test file makes isolation harder and failure messages harder to read. Using the project's existing naming convention ensures the test runner discovers the file automatically.
Minimal structure:
# Python
class TestOrder:
pass
# Java
class OrderTest extends TestCase { }
# TypeScript
describe('Order', () => { })
# Go
func TestOrder(t *testing.T) { }
Run the empty test file immediately to confirm the runner finds and executes it without errors.
Before writing any test methods, define the shared state that every test will need. The test framework's setUp/beforeEach/setup hook runs before each test; tearDown/afterEach/cleanup runs after.
Why: Each test must be fully isolated — it must not depend on execution order, and it must not leave side effects that corrupt the next test. Setup creates a fresh environment; teardown cleans up resources (open files, database connections, temp files). Without this isolation, a failure in test 3 can cause test 4 to fail for unrelated reasons, making debugging misleading.
Guidelines:
finally blocks or the framework's guaranteed cleanup mechanism.# Python example
class TestFileProcessor:
def setup_method(self):
self.input_file = open("testdata/sample.txt", "r")
def teardown_method(self):
self.input_file.close()
For each public method, test the central, intended behavior first — the happy path. Ask: "What is this method supposed to do when given valid, typical input?"
Why: Start with normal behavior so you confirm the code works correctly before probing its edges. If normal behavior tests fail, the code is broken before you even touch it — that is useful information and must be resolved before any refactoring begins.
Rules:
test_read_returns_correct_character, not test1. Descriptive names are the failure message.'x' == result instead of 'd' == result). If it does not fail, the test is not testing what you think.def test_read_returns_correct_character(self):
# advance past the first three characters
for _ in range(3):
self.input_file.read(1)
ch = self.input_file.read(1)
assert ch == 'd' # fourth character in the test file
After normal behavior is covered, identify the boundaries where behavior could change or break. Boundary conditions are the most productive place to find bugs.
Why: Most bugs hide at the edges — the first item, the last item, the empty collection, the zero value, the maximum value. Fowler calls this "playing the part of an enemy to your own code" — actively trying to find the conditions under which the code will fail, rather than confirming it works for typical input.
Common boundary categories:
| Category | Examples |
|---|---|
| Sequence edges | First element, last element, element after the last |
| Empty inputs | Empty string, empty list, empty file, zero-length collection |
| Zero / null values | Zero quantity, null reference, None, empty optional |
| Maximum / minimum values | Integer overflow boundary, max string length, single-item list |
| Repeated calls | Reading past end-of-file twice, calling close twice |
For each boundary, write a separate test method. Add a descriptive message to assertions so that when a boundary test fails, the output tells you which boundary broke.
def test_read_at_end_of_file_returns_minus_one(self):
# consume all 141 characters
for _ in range(141):
self.input_file.read(1)
result = self.input_file.read(1)
assert result == -1, "read at end of file should return -1"
def test_read_from_empty_file_returns_minus_one(self):
empty = open("testdata/empty.txt", "r")
result = empty.read(1)
empty.close()
assert result == -1, "read from empty file should return -1"
Test that error conditions produce the correct error, not just that they do not crash silently. If the code's contract says "raises ValueError on negative input" or "raises IOError if the stream is closed," write a test that verifies exactly that.
Why: Errors are part of the public contract. Failing to raise the expected error — or raising the wrong one — is a bug. These tests also protect against future refactoring silently swallowing exceptions.
Pattern:
pytest.raises, assertRaises, or expect { }.to raise_error idiom.fail("expected error was not raised").def test_read_after_close_raises_io_error(self):
self.input_file.close()
with pytest.raises(IOError):
self.input_file.read(1)
# if no IOError is raised, pytest.raises will fail the test automatically
Run the entire test suite. All tests must pass — green — before any refactoring step begins.
Why: This is the precondition that makes refactoring safe. If the suite is red before you start, you do not know whether a subsequent red result was caused by your change or by a pre-existing bug. You must start from a known-good baseline.
What to do if tests are red before you start:
The compile-and-test gate (applies to every subsequent step): Once the suite is green and refactoring begins, apply this gate after every single atomic change — not after a batch of changes:
make one atomic change → compile/lint → run test suite
green → continue to next change
red → revert immediately, try a smaller step
"Atomic" means the smallest possible change that can be independently compiled and tested: extract one method, rename one variable, move one field. Never accumulate multiple changes before testing. Small steps mean small reverting cost.
If a language has a compiler, compile first — compilation errors caught before test execution are faster feedback than test failures.
When fixing a bug rather than refactoring, use this variant workflow:
Why test first for bugs: Writing the test first forces you to understand exactly what the bug is, not approximately what it is. It also prevents you from accidentally fixing a different problem and convincing yourself the bug is gone. And the test permanently guards against the same bug recurring.
When a bug report arrives:
A test suite is sufficient for refactoring when it satisfies all four of these criteria:
| Criterion | What to Check |
|---|---|
| Normal behavior covered | Every public method has at least one test for its primary intended behavior |
| Boundaries covered | Each method has tests for: empty input, first/last element, value after the last, zero/null values |
| Error paths covered | Every documented error condition or exception has a test that verifies it is raised correctly |
| Fast enough to run after every step | The full suite completes in under 30 seconds. If it takes longer, it will not be run frequently enough |
What you do not need:
Fowler's practical rule: Test the areas you are most worried about going wrong. Concentrate effort where complexity is highest and where bugs would be hardest to find manually. It is better to run incomplete tests than to have no tests because a complete suite felt impossible to write.
1. Tests must be self-checking. Tests that print output to the console for a human to inspect are not self-checking. Every assertion must be evaluated by the framework automatically. The only acceptable output is a pass/fail signal — ideally a progress bar that turns red on failure.
2. Tests must be fast. Slow tests do not get run. If the suite takes more than 30 seconds, developers will batch changes and run tests infrequently. Infrequent testing means bugs accumulate between runs, making them harder to isolate. For refactoring specifically, tests must be fast enough to run after every single atomic step.
3. Each test must be isolated. A test must not depend on the results of any other test. Execution order must not matter. Use setup/teardown to ensure each test starts from an identical, known state.
4. Verify that tests can fail. When you write a test, temporarily insert a wrong value into the assertion. If the test does not turn red, it is not exercising what you think. A test that cannot fail is not a test — it is false confidence.
5. Incomplete tests beat no tests. The most common failure mode is paralysis: "I can't test everything perfectly, so I won't test anything." Write the tests for the risky areas first. Run them. An imperfect suite that runs frequently is vastly more valuable than a theoretically complete suite that never gets written.
6. The compile-and-test gate is non-negotiable. Every atomic refactoring step ends with: compile + run suite. Red = revert. No exceptions. This is what makes refactoring safe to do in a production codebase.
Situation: You want to decompose a 200-line calculate_invoice() method into smaller methods but there are no tests.
Setup fixture: Create an Invoice object with known line items and tax rates.
Normal behavior tests: Assert that calculate_invoice() returns the correct total for a standard order.
Boundary tests: Empty order (zero line items), single item, order with a discount applied to zero-priced items.
Error tests: Negative quantity raises ValueError, unknown product code raises KeyError.
Green gate: All pass. Now decompose calculate_invoice() one extracted method at a time, running after each extraction.
Situation: A bug report says orders over $1,000 are applying the discount twice.
Step 1 — Write failing test:
def test_discount_applied_once_for_large_order(self):
order = Order(items=[Item("product-A", quantity=10, unit_price=150)]) # total = $1,500
assert order.total_price() == 1350.00 # 10% discount applied once
Step 2 — Run it. It fails (returns 1215.00 — discount applied twice). Good. The test reproduces the bug. Step 3 — Fix the discount logic. Step 4 — Run full suite. All green including the new test. Bug is fixed and regression-protected.
Situation: A module has 12 tests. You want to refactor its data model.
Audit checklist:
Action: Add boundary and error path tests. Run. Green. Now proceed with the refactoring.
This skill is licensed under CC-BY-SA-4.0. Source: BookForge — Refactoring: Improving the Design of Existing Code by Martin Fowler.
refactoring-readiness-assessment — Assess whether code is ready to refactorcode-smell-diagnosis — Identify which smells to address firstmethod-decomposition-refactoring — Apply once this test suite is greenBrowse more BookForge skills: bookforge-skills