Install
openclaw skills install bookforge-unit-test-quality-checkerEvaluate a test suite against rigorous unit-test criteria, classify test types, and choose between fake and mock objects. Use whenever a developer asks 'are these unit tests?', 'why is my test suite slow', 'should I use mocks or fakes', 'fake vs mock', 'what's wrong with my tests', 'my tests hit the database / network / filesystem', 'how do I speed up tests', 'unit vs integration', or when reviewing a codebase's test quality. Activates for 'unit test', 'fake object', 'mock object', 'test double', 'test speed', 'test isolation', 'xUnit', 'JUnit / NUnit / CppUnit', 'slow test suite', 'flaky tests', 'test pyramid'.
openclaw skills install bookforge-unit-test-quality-checkerEvaluate whether existing tests are true unit tests per Feathers' criteria, classify each test by type, assess test double choices, and produce a scored quality report with a prioritized improvement plan.
Use this skill when:
Do not use this skill to write brand-new tests from scratch. For writing characterization tests, use characterization-test-writing. For breaking the dependencies that make tests slow, use dependency-breaking-technique-executor.
Before beginning, confirm:
test/, src/test/, spec/, etc.)If the user cannot run tests, timing analysis will be static (pattern-based). Flag this explicitly in the report.
Use Glob and Grep to enumerate test files and count individual test cases.
Glob: test/**/*.{java,py,ts,js,rb,cs,go} (adapt to language)
Grep: pattern for test method/function declarations
Record: total test count, test framework detected, test file structure.
Why: You cannot classify tests you cannot see. A count also reveals whether the suite is at a scale where slow tests compound into minutes of build lag.
For each test file, grep for patterns that signal disqualification from unit-test status.
The four disqualifiers — a test is NOT a unit test if it:
Grep patterns to run (adapt to language):
| Disqualifier | Patterns to search |
|---|---|
| Database | DataSource, EntityManager, @Repository, sqlite3, psycopg2, mongoose, ActiveRecord, db.execute, cursor.execute |
| Network | HttpClient, requests.get, fetch(, RestTemplate, socket., urllib, axios, WebClient |
| Filesystem | File(, open(, fs.readFile, FileInputStream, Path.of, os.path, tmpfile, shutil |
| Env setup | System.getenv, os.environ, process.env, dotenv, application.properties loaded at test level, @SpringBootTest, @IntegrationTest |
For each test flagged, record: file, test name, disqualifier type, line reference.
Why: These four disqualifiers are concrete and checkable. A test can be logically small while still violating all four. The disqualifiers are the reason Feathers' definition diverges from the informal "small test" meaning — the exclusions exist because DB/network/FS/env dependencies make tests slow, flaky, and unable to localize failures.
If Bash access is available, run the test suite (or a representative subset) and measure per-test timing.
# JUnit / Maven
mvn test -pl module 2>&1 | grep "Tests run"
# pytest with timing
pytest --tb=no -q --durations=20
# Jest
npx jest --verbose --testPathPattern="..."
# Go
go test ./... -v -timeout 60s 2>&1 | grep -E "^--- (PASS|FAIL)"
The threshold: Any individual test taking longer than 1/10 second (100ms) is a slow unit test. This is not a guideline — it is the definition. A test that takes 1/10 second to run IS a slow unit test.
Flag every test exceeding 100ms. For suites where aggregate time exceeds 60 seconds, flag the top-10 slowest tests as priority targets.
Why: Speed is not a nice-to-have quality — it is definitional. A slow test suite stops giving developers fast feedback. When tests take minutes, developers stop running them. The entire feedback loop breaks.
Using the exclusion results and timing data, classify each test into one of three categories:
Unit test: Passes all four exclusion criteria AND runs in under 100ms. Failure points to a specific, small scope of code.
Higher-level test: Intentionally covers interactions across multiple classes or components. May cross boundaries (DB, network, FS) by design. Slower. Useful for pinning down integration behavior when unit tests are hard to write. These are legitimate — do not demand they meet unit-test criteria. They serve a different purpose.
Characterization test: Written to document actual current behavior, not to prove correctness. Typically found in legacy codebases as safety nets before refactoring. Key marker: the assertion matches what the code currently does (possibly including bugs), not what it should do. Purpose: hold behavior stable while nearby code changes. Legitimate category — do not reclassify as bugs.
Edge/unclear: Tests that do not clearly fit the above — include in report for manual review.
Why: Treating all slow tests as "bad unit tests" is wrong. A higher-level test covering a checkout workflow is supposed to hit the database — that is its job. Misclassification leads to either abandoning useful tests or writing pointless unit tests for integration behavior. The classification enables the right improvement per category.
Grep for existing fakes and mocks in the test codebase.
Grep: "class Fake", "Mock", "Stub", "Spy", "when(", "verify(", "mock(", "thenReturn"
For each test double found, assess:
Is it a fake object or a mock object?
A fake object impersonates a collaborator. It has two sides:
getLastLine(), getLastCall())Sensing is external: the test calls a query method to inspect what happened.
A mock object asserts conditions internally. Expectations are set before the operation, and verify() is called after. The assertions live inside the mock, not the test body. Most mock frameworks (Mockito, unittest.mock, RSpec doubles) implement this pattern.
Selection recommendation:
The rule: "Simple fake objects suffice in most situations."
Why: The choice is not aesthetic. Fakes externalize sensing (the test controls what to check). Mocks internalize assertions (the mock controls what matters). When fakes proliferate with repetitive query methods, mocks become the more maintainable option. But mock overuse — especially mocking everything — produces brittle tests that break on implementation changes rather than behavior changes.
Write the report artifact to test-suite-quality-report.md using the template in the Outputs section.
| Input | Required | Description |
|---|---|---|
| Test directory or file paths | Yes | Where tests live in the codebase |
| Language and test framework | Yes | For correct grep patterns and timing commands |
| Source code (for context on collaborators) | Recommended | Helps identify what test doubles are substituting |
| Ability to run tests via Bash | Optional | Required for timing data; static analysis possible without |
Write test-suite-quality-report.md to the project root (or a location the user specifies).
Report template:
# Test Suite Quality Report
Generated: [date]
Test framework: [framework]
Total tests found: [N]
---
## Summary
| Category | Count | % of Total |
|---|---|---|
| Unit tests (passing all criteria) | | |
| Higher-level tests (intentional) | | |
| Characterization tests | | |
| Disqualified (claim to be unit, are not) | | |
| Unclear / needs review | | |
Slow tests (>100ms): [N]
Suite aggregate time: [Xs]
---
## Disqualification Violations
| Test | File | Disqualifier | Line | Recommendation |
|---|---|---|---|---|
| [test name] | [path] | [DB / network / FS / env] | [line] | [break dependency / reclassify] |
---
## Timing Results
| Test | Time (ms) | Category | Action |
|---|---|---|---|
| [test name] | [N]ms | [unit/higher/char] | [break dep / reclassify / ok] |
Top-10 slowest: [list]
---
## Test Double Assessment
| Double | File | Type (fake/mock) | Justified? | Recommendation |
|---|---|---|---|---|
| [name] | [path] | [fake/mock] | [yes/no] | [keep/convert/simplify] |
---
## Prioritized Improvement Plan
### High Priority (impacts speed most)
1. [Test name] — [action] — estimated gain: [Xs]
### Medium Priority (disqualification violations)
1. [Test name] — [action]
### Low Priority (reclassification / cleanup)
1. [Test name] — [action]
---
## Notes
- Higher-level tests left in place: [list with rationale]
- Characterization tests identified: [list]
- Tests requiring manual review: [list]
Speed is definitional, not optional. A test that takes 1/10 second to run is, by definition, a slow unit test. This is not a style preference — it is the criterion that separates tests that give fast feedback from tests that do not.
If it talks to DB, network, filesystem, or requires env setup, it is not a unit test. These four disqualifiers are binary and checkable. A test violating any one of them is a higher-level test, regardless of how small or well-named it is.
Fake objects first; mocks when expectations become the test. Fakes are simpler and always available. Move to a mocking framework when writing many similar fakes or when verifying call behavior is the core assertion.
Characterization tests are legitimate — do not demand they meet unit-test criteria. They document actual behavior, not intended behavior. Classifying them as "broken unit tests" and deleting them removes the safety net before refactoring.
Higher-level tests are not failures. A test that intentionally covers an integration boundary serves a different purpose than a unit test. The improvement goal is not to eliminate all higher-level tests — it is to ensure tests that claim to be unit tests actually are, and to write unit tests for logic that currently has none.
Situation: A team reports their "unit test" suite of 320 tests takes 8 minutes to run. They run it rarely as a result.
What the skill does:
DataSource, @Transactional, EntityManager, JdbcTemplate — finds 180 tests with DB patterns.mvn test with --durations flag (or parses surefire XML) — confirms those 180 tests average 2.5 seconds each.mvn test -P integration). For the business logic inside those DB tests, extract pure computation into separate classes and add real unit tests. Use Parameterize Constructor or Extract Interface to inject a fake repository for the logic currently entangled with DB calls.Situation: A team is writing tests for an OrderService that calls a PaymentGateway. Half the team wants to write a FakePaymentGateway class; the other half wants Mockito's when(gateway.charge(...)).thenReturn(...) pattern.
What the skill does:
FakePaymentGateway, MockPaymentGateway, StubPaymentGateway — finds none.FakePaymentGateway implements PaymentGateway with a getLastChargeAmount() query method. The test calls orderService.submitOrder(order), then asserts fakeGateway.getLastChargeAmount() equals expected amount. No framework needed.verify() pattern. The mock framework earns its complexity at that point.Situation: A Python codebase has 95 tests in tests/. The team has no idea which are unit tests. Some use requests_mock, some use pytest-django, some are plain assert statements.
What the skill does:
tests/**/*.py, greps for: requests, os.path, open(, @pytest.mark.django_db, client.get(, tmpdir, os.environ.@pytest.mark.django_db — higher-level (DB), legitimate, leave in place.requests_mock — these are unit tests (the fake intercepts network; no real network calls). Pass.tmpdir (pytest fixture for temp filesystem) — examine case by case: if the business logic only needs a file path abstraction, reclassify and extract; if the test genuinely exercises file handling behavior, reclassify as higher-level.assert on pure functions — unit tests, fast, no dependencies.pytest --durations=10 reveals the @pytest.mark.django_db tests average 300ms; pure tests average 3ms.requests_mock = mock pattern, recommended; unittest.mock.patch in 6 tests = mock overuse suspected — review whether patching internals is causing brittleness).references/unit-test-criteria.md — full text of Feathers' two positive qualities and four exclusion criteria (Chapter 2)references/fake-vs-mock-objects.md — structural anatomy of fake objects (two sides) and mock objects (internal assertions), with selection rule (Chapter 3, Chapter 5)references/test-type-taxonomy.md — unit / higher-level / characterization taxonomy with philosophy note on "testing to detect change" vs "testing to show correctness" (Chapter 2, Chapter 13)CC-BY-SA 4.0. Derived from "Working Effectively with Legacy Code" by Michael C. Feathers (2004). Skill content is transformative — original book required for full context.
characterization-test-writing — once you know which tests are characterization tests (this skill classifies them), use this to write morelegacy-code-change-algorithm — the master procedure for safe change; this skill's output feeds directly into Step 4 (Write Tests) of that algorithmdependency-breaking-technique-executor — when this skill identifies tests that hit DB/network/FS, that skill provides the mechanics to break those dependencies and enable true unit tests