Unit Test Quality Checker

Evaluate a test suite against rigorous unit-test criteria, classify test types, and choose between fake and mock objects. Use whenever a developer asks 'are these unit tests?', 'why is my test suite slow', 'should I use mocks or fakes', 'fake vs mock', 'what's wrong with my tests', 'my tests hit the database / network / filesystem', 'how do I speed up tests', 'unit vs integration', or when reviewing a codebase's test quality. Activates for 'unit test', 'fake object', 'mock object', 'test double', 'test speed', 'test isolation', 'xUnit', 'JUnit / NUnit / CppUnit', 'slow test suite', 'flaky tests', 'test pyramid'.

Audits

Pass

ClawScanPass

Agentic behavior and permission review.

Static analysisPass

Pattern checks against bundled files.

VirusTotalPass

Multi-engine malware detections and file reputation.

Install

openclaw skills install bookforge-unit-test-quality-checker

Unit Test Quality Checker

Evaluate whether existing tests are true unit tests per Feathers' criteria, classify each test by type, assess test double choices, and produce a scored quality report with a prioritized improvement plan.

When to Use

Use this skill when:

A test suite is slow and you need to understand why
You are auditing a legacy codebase's test coverage quality
A team debates whether their "unit tests" are actually unit tests
Someone asks "should I use a mock or a fake here?"
Tests are flaky, hard to run locally, or require environment setup before running

Do not use this skill to write brand-new tests from scratch. For writing characterization tests, use characterization-test-writing. For breaking the dependencies that make tests slow, use dependency-breaking-technique-executor.

Context and Input Gathering

Before beginning, confirm:

Test file paths or directory — where do the tests live? (test/, src/test/, spec/, etc.)
Language and test framework — JUnit, pytest, NUnit, RSpec, Vitest, Go testing, etc.
Can tests be run? — Bash access to run the full suite or individual files and measure timing?
Are there known test doubles? — Any existing fakes, mocks, stubs, or spies in the codebase?

If the user cannot run tests, timing analysis will be static (pattern-based). Flag this explicitly in the report.

Process

Step 1: Inventory Tests

Use Glob and Grep to enumerate test files and count individual test cases.

Glob: test/**/*.{java,py,ts,js,rb,cs,go}  (adapt to language)
Grep: pattern for test method/function declarations

Record: total test count, test framework detected, test file structure.

Why: You cannot classify tests you cannot see. A count also reveals whether the suite is at a scale where slow tests compound into minutes of build lag.

Step 2: Apply the Four Exclusion Criteria

For each test file, grep for patterns that signal disqualification from unit-test status.

The four disqualifiers — a test is NOT a unit test if it:

Talks to a database
Communicates across a network
Touches the file system
Requires special environment setup to run (editing config files, setting env vars, starting services)

Grep patterns to run (adapt to language):

Disqualifier	Patterns to search
Database	`DataSource`, `EntityManager`, `@Repository`, `sqlite3`, `psycopg2`, `mongoose`, `ActiveRecord`, `db.execute`, `cursor.execute`
Network	`HttpClient`, `requests.get`, `fetch(`, `RestTemplate`, `socket.`, `urllib`, `axios`, `WebClient`
Filesystem	`File(`, `open(`, `fs.readFile`, `FileInputStream`, `Path.of`, `os.path`, `tmpfile`, `shutil`
Env setup	`System.getenv`, `os.environ`, `process.env`, `dotenv`, `application.properties` loaded at test level, `@SpringBootTest`, `@IntegrationTest`

For each test flagged, record: file, test name, disqualifier type, line reference.

Why: These four disqualifiers are concrete and checkable. A test can be logically small while still violating all four. The disqualifiers are the reason Feathers' definition diverges from the informal "small test" meaning — the exclusions exist because DB/network/FS/env dependencies make tests slow, flaky, and unable to localize failures.

Step 3: Measure Timing

If Bash access is available, run the test suite (or a representative subset) and measure per-test timing.

# JUnit / Maven
mvn test -pl module 2>&1 | grep "Tests run"

# pytest with timing
pytest --tb=no -q --durations=20

# Jest
npx jest --verbose --testPathPattern="..."

# Go
go test ./... -v -timeout 60s 2>&1 | grep -E "^--- (PASS|FAIL)"

The threshold: Any individual test taking longer than 1/10 second (100ms) is a slow unit test. This is not a guideline — it is the definition. A test that takes 1/10 second to run IS a slow unit test.

Flag every test exceeding 100ms. For suites where aggregate time exceeds 60 seconds, flag the top-10 slowest tests as priority targets.

Why: Speed is not a nice-to-have quality — it is definitional. A slow test suite stops giving developers fast feedback. When tests take minutes, developers stop running them. The entire feedback loop breaks.

Step 4: Classify Each Test

Using the exclusion results and timing data, classify each test into one of three categories:

Unit test: Passes all four exclusion criteria AND runs in under 100ms. Failure points to a specific, small scope of code.

Higher-level test: Intentionally covers interactions across multiple classes or components. May cross boundaries (DB, network, FS) by design. Slower. Useful for pinning down integration behavior when unit tests are hard to write. These are legitimate — do not demand they meet unit-test criteria. They serve a different purpose.

Characterization test: Written to document actual current behavior, not to prove correctness. Typically found in legacy codebases as safety nets before refactoring. Key marker: the assertion matches what the code currently does (possibly including bugs), not what it should do. Purpose: hold behavior stable while nearby code changes. Legitimate category — do not reclassify as bugs.

Edge/unclear: Tests that do not clearly fit the above — include in report for manual review.

Why: Treating all slow tests as "bad unit tests" is wrong. A higher-level test covering a checkout workflow is supposed to hit the database — that is its job. Misclassification leads to either abandoning useful tests or writing pointless unit tests for integration behavior. The classification enables the right improvement per category.

Step 5: Evaluate Existing Test Doubles

Grep for existing fakes and mocks in the test codebase.

Grep: "class Fake", "Mock", "Stub", "Spy", "when(", "verify(", "mock(", "thenReturn"

For each test double found, assess:

Is it a fake object or a mock object?

A fake object impersonates a collaborator. It has two sides:

A production-facing side that implements the real interface's methods
A test-facing side with query methods the test calls after the operation (e.g., getLastLine(), getLastCall())

Sensing is external: the test calls a query method to inspect what happened.

A mock object asserts conditions internally. Expectations are set before the operation, and verify() is called after. The assertions live inside the mock, not the test body. Most mock frameworks (Mockito, unittest.mock, RSpec doubles) implement this pattern.

Selection recommendation:

Default to fake objects. They are simpler, require no framework, work in any language, and make tests easier to read.
Use mock objects (or a mocking framework) when: (a) you are writing many similar fakes for different scenarios and the custom query-method approach becomes repetitive, or (b) you specifically need to assert that a method was called with particular arguments and call-count matters.

The rule: "Simple fake objects suffice in most situations."

Why: The choice is not aesthetic. Fakes externalize sensing (the test controls what to check). Mocks internalize assertions (the mock controls what matters). When fakes proliferate with repetitive query methods, mocks become the more maintainable option. But mock overuse — especially mocking everything — produces brittle tests that break on implementation changes rather than behavior changes.

Step 6: Produce the Quality Report

Write the report artifact to test-suite-quality-report.md using the template in the Outputs section.

Inputs

Input	Required	Description
Test directory or file paths	Yes	Where tests live in the codebase
Language and test framework	Yes	For correct grep patterns and timing commands
Source code (for context on collaborators)	Recommended	Helps identify what test doubles are substituting
Ability to run tests via Bash	Optional	Required for timing data; static analysis possible without

Outputs

Write test-suite-quality-report.md to the project root (or a location the user specifies).

Report template:

# Test Suite Quality Report
Generated: [date]
Test framework: [framework]
Total tests found: [N]

---

## Summary

| Category | Count | % of Total |
|---|---|---|
| Unit tests (passing all criteria) | | |
| Higher-level tests (intentional) | | |
| Characterization tests | | |
| Disqualified (claim to be unit, are not) | | |
| Unclear / needs review | | |

Slow tests (>100ms): [N]
Suite aggregate time: [Xs]

---

## Disqualification Violations

| Test | File | Disqualifier | Line | Recommendation |
|---|---|---|---|---|
| [test name] | [path] | [DB / network / FS / env] | [line] | [break dependency / reclassify] |

---

## Timing Results

| Test | Time (ms) | Category | Action |
|---|---|---|---|
| [test name] | [N]ms | [unit/higher/char] | [break dep / reclassify / ok] |

Top-10 slowest: [list]

---

## Test Double Assessment

| Double | File | Type (fake/mock) | Justified? | Recommendation |
|---|---|---|---|---|
| [name] | [path] | [fake/mock] | [yes/no] | [keep/convert/simplify] |

---

## Prioritized Improvement Plan

### High Priority (impacts speed most)
1. [Test name] — [action] — estimated gain: [Xs]

### Medium Priority (disqualification violations)
1. [Test name] — [action]

### Low Priority (reclassification / cleanup)
1. [Test name] — [action]

---

## Notes

- Higher-level tests left in place: [list with rationale]
- Characterization tests identified: [list]
- Tests requiring manual review: [list]

Key Principles

Speed is definitional, not optional. A test that takes 1/10 second to run is, by definition, a slow unit test. This is not a style preference — it is the criterion that separates tests that give fast feedback from tests that do not.
If it talks to DB, network, filesystem, or requires env setup, it is not a unit test. These four disqualifiers are binary and checkable. A test violating any one of them is a higher-level test, regardless of how small or well-named it is.
Fake objects first; mocks when expectations become the test. Fakes are simpler and always available. Move to a mocking framework when writing many similar fakes or when verifying call behavior is the core assertion.
Characterization tests are legitimate — do not demand they meet unit-test criteria. They document actual behavior, not intended behavior. Classifying them as "broken unit tests" and deleting them removes the safety net before refactoring.
Higher-level tests are not failures. A test that intentionally covers an integration boundary serves a different purpose than a unit test. The improvement goal is not to eliminate all higher-level tests — it is to ensure tests that claim to be unit tests actually are, and to write unit tests for logic that currently has none.

Examples

Example A: JUnit Suite with Slow DB-Hitting Tests

Situation: A team reports their "unit test" suite of 320 tests takes 8 minutes to run. They run it rarely as a result.

What the skill does:

Greps test files for DataSource, @Transactional, EntityManager, JdbcTemplate — finds 180 tests with DB patterns.
Runs mvn test with --durations flag (or parses surefire XML) — confirms those 180 tests average 2.5 seconds each.
Classifies: 140 tests pass all 4 exclusion criteria and run under 100ms (unit tests); 180 tests hit the DB (higher-level); remaining are unclear.
Report shows: the 8-minute build is caused entirely by the 180 DB tests; the true unit tests run in 14 seconds.
Improvement plan: reclassify the 180 as integration tests and move them to a separate Maven profile (mvn test -P integration). For the business logic inside those DB tests, extract pure computation into separate classes and add real unit tests. Use Parameterize Constructor or Extract Interface to inject a fake repository for the logic currently entangled with DB calls.

Example B: Team Debating Fake vs Mock for a Payment Gateway

Situation: A team is writing tests for an OrderService that calls a PaymentGateway. Half the team wants to write a FakePaymentGateway class; the other half wants Mockito's when(gateway.charge(...)).thenReturn(...) pattern.

What the skill does:

Asks: are there already multiple fake gateway implementations in the test codebase? Greps for FakePaymentGateway, MockPaymentGateway, StubPaymentGateway — finds none.
Recommends: start with a fake object. Write FakePaymentGateway implements PaymentGateway with a getLastChargeAmount() query method. The test calls orderService.submitOrder(order), then asserts fakeGateway.getLastChargeAmount() equals expected amount. No framework needed.
Follow-up rule: if the team later writes 8 similar fake gateways (for different error scenarios, partial charge cases, etc.) and the query-method pattern becomes repetitive, then switch to Mockito's verify() pattern. The mock framework earns its complexity at that point.
Key distinction for the team: fakes are better when sensing (observing what happened) is the goal; mocks are better when verifying call behavior (exactly which method was called, with which arguments, how many times) is the goal.

Example C: Python pytest Suite with Mixed Test Types

Situation: A Python codebase has 95 tests in tests/. The team has no idea which are unit tests. Some use requests_mock, some use pytest-django, some are plain assert statements.

What the skill does:

Globs tests/**/*.py, greps for: requests, os.path, open(, @pytest.mark.django_db, client.get(, tmpdir, os.environ.
Classifies:
- 40 tests with @pytest.mark.django_db — higher-level (DB), legitimate, leave in place.
- 12 tests using requests_mock — these are unit tests (the fake intercepts network; no real network calls). Pass.
- 8 tests using tmpdir (pytest fixture for temp filesystem) — examine case by case: if the business logic only needs a file path abstraction, reclassify and extract; if the test genuinely exercises file handling behavior, reclassify as higher-level.
- 35 tests with plain assert on pure functions — unit tests, fast, no dependencies.
Timing: pytest --durations=10 reveals the @pytest.mark.django_db tests average 300ms; pure tests average 3ms.
Report: 2 categories of test doubles found (requests_mock = mock pattern, recommended; unittest.mock.patch in 6 tests = mock overuse suspected — review whether patching internals is causing brittleness).

References

references/unit-test-criteria.md — full text of Feathers' two positive qualities and four exclusion criteria (Chapter 2)
references/fake-vs-mock-objects.md — structural anatomy of fake objects (two sides) and mock objects (internal assertions), with selection rule (Chapter 3, Chapter 5)
references/test-type-taxonomy.md — unit / higher-level / characterization taxonomy with philosophy note on "testing to detect change" vs "testing to show correctness" (Chapter 2, Chapter 13)

License

CC-BY-SA 4.0. Derived from "Working Effectively with Legacy Code" by Michael C. Feathers (2004). Skill content is transformative — original book required for full context.

Related BookForge Skills

characterization-test-writing — once you know which tests are characterization tests (this skill classifies them), use this to write more
legacy-code-change-algorithm — the master procedure for safe change; this skill's output feeds directly into Step 4 (Write Tests) of that algorithm
dependency-breaking-technique-executor — when this skill identifies tests that hit DB/network/FS, that skill provides the mechanics to break those dependencies and enable true unit tests