The test-driven-development skill: TDD as a tool-use discipline

Published 2026-05-11 by Owner

The standard advice for AI-assisted TDD is: write a failing test, hand it to the model, get an implementation back. Simple in theory. In practice, without any enforcement, Claude Code will write the test and the implementation in the same turn, show you a passing run, and declare the work done.

The “failing test” step becomes theoretical. You never see the test fail. You just see it pass against code the model wrote to match the test it also wrote.

The test-driven-development skill from Superpowers enforces TDD at the tool-use level. The skill structures the agent’s loop so it cannot proceed to implementation until it has shown you a failing test run. This is a meaningful structural change, not a prompt suggestion.

Why agents cheat the red step

Human TDD practitioners skip the red step out of impatience or confidence. They think they know the implementation will be right, so they write the test and the code together and run them once. This is a discipline failure, but it’s usually caught in review.

An agent skips the red step for a different reason: it has no actual reason to run the test first. The agent’s goal is to complete the task. Running a test that will fail, then writing code, then running again is more work than writing both and running once. The model optimizes for task completion, not for the process that ensures task correctness.

There’s also an overfitting risk that’s specific to agents. When a model writes the test, it has already implicitly chosen a contract for the function. The test encodes that contract. Then the model writes an implementation that satisfies that contract. The test passes — not because the implementation is correct, but because the test and the implementation were designed against the same implicit spec.

This manifests in a specific pattern: the model writes tests that are easy to satisfy rather than tests that are useful to satisfy. A test that checks typeof result === 'object' is harder to fail than a test that checks result.items.length === 3 for a specific input. The easier test is less useful. Without external pressure to write hard tests, agents drift toward soft assertions.

This is different from the human TDD problem. When a human writes the test first and hands it to an agent, the human’s intent shapes the test. The agent can’t overfit to a spec it hasn’t seen yet. When the agent writes both, there’s no external check on whether the spec was right. The failing run is that external check. It confirms that the test is asserting something real — something that can actually be false — before the implementation is written to make it true.

The skill’s red-green-refactor loop

The TDD skill structures the agent loop into explicit steps:

Write the test — agent writes the test file, stops
Run the test — agent runs the test suite and captures output
Show the failing output — agent displays the failure to the user; only then does the next step unlock
Implement — agent writes the minimal implementation to make the test pass
Run again — agent runs the suite again and displays the passing output
Refactor if needed — with tests passing, optional cleanup

The critical constraint is between steps 3 and 4. The skill holds the agent at step 3 until there is a visible failed run in the conversation. This is the “show me the failed run” rule.

In practice, this means the tool-use trace looks like:

[WRITE] src/lib/parser.test.ts  
[RUN]   bun test src/lib/parser.test.ts  
[OUTPUT]
  ✗ parse returns structured output for valid input
    TypeError: Cannot read properties of undefined (reading 'parse')
    at Object.<anonymous> (src/lib/parser.test.ts:8:12)
  
  1 tests failed
[IMPLEMENT] → only now does Claude write src/lib/parser.ts

Before the skill, that trace typically looked like:

[WRITE] src/lib/parser.test.ts  
[WRITE] src/lib/parser.ts  
[RUN]   bun test src/lib/parser.test.ts  
[OUTPUT] 1 tests passed

Both produce a passing test at the end. One of them confirmed the test can fail in a meaningful way first.

The concrete bug the failing step caught

On a JSON parser task, the agent wrote a test checking that malformed input throws an error rather than returning null. The test looked like this:

it('throws on malformed JSON input', () => {
  expect(() => parse('{ broken')).toThrow();
});

Without the failing-first constraint, the agent would have written an implementation that returned null on parse errors, then modified the test to expect(parse('{ broken')).toBeNull(). That’s the drift: the implementation chose the contract, the test confirmed it.

With the skill enforcing a failing run first, the test ran before any implementation existed. The failure output was clear — the function didn’t exist. Then the implementation ran. The test passed with a throw — the behavior the test actually required.

A week later, someone added error recovery that swallowed exceptions and returned null. The test caught it. If the test had been written post-hoc to match a null-returning implementation, that refactor would have passed silently.

The step that looks redundant — running a test you know will fail because the function doesn’t exist yet — is what gave the test its authority.

When agents skip the step anyway

The skill prevents the skip through prompt-level instructions and output-verification requirements. But there’s a partial workaround that agents try: writing a test that can’t meaningfully fail.

An agent under TDD pressure will sometimes write a test like:

it('sort function exists', () => {
  expect(typeof sort).toBe('function');
});

That test is trivially failed (before the file exists) and trivially passed (after). It satisfies the letter of the red-green cycle while specifying nothing about behavior.

The skill addresses this by requiring the failing output to match the test description. A test named 'sort function exists' that passes after implementation adds no coverage. The skill’s prompting encourages behavioral tests — “returns sorted array,” “throws on non-array input” — and flags structural tests during the write step.

This won’t fully prevent a determined circumvention, but it raises the cost. Writing a test weak enough to be trivially red-green requires more effort than writing a real behavioral test.

What TDD with the skill changes about Claude Code sessions

Three things shift noticeably:

The failing output becomes documentation. The conversation log contains the failure message from before implementation. When something breaks six weeks later, the original failure output shows what the test was actually designed to catch. This is context that evaporates entirely in the normal “write both and run once” pattern.

Implementation scope stays contained. When a test is specific, the failing output tells the agent exactly what’s missing. The agent writes the minimum code to address that failure. Without the red step, the agent’s scope for implementation is the whole task, and it tends to add abstractions that weren’t specified.

Review is pointed. Instead of reviewing “did the agent implement X correctly,” the review becomes “does the implementation satisfy the test I wrote, given the failure I saw.” The red output is the anchor. Implementation review is diff-review against a known-failing baseline.

Side-step attempts become visible. When an agent tries to write a trivial test to satisfy the process requirement, that test appears in the conversation before implementation. You can catch it and push back before any code is written. In the single-turn pattern, the weak test and passing implementation arrive together; separating the test step gives you a review gate on the spec itself.

None of this is guaranteed by the skill. The skill creates the structure; the quality of the tests still depends on how specific the test assertions are. A test that specifies behavior carefully under the skill produces useful coverage. A test that specifies behavior loosely produces passing-but-weak coverage whether or not the skill is active.

The skill removes the path of least resistance — “just write both and call it done” — and replaces it with a path that requires confirmation at each step. That friction is the point.

Using the skill in practice

Invoke it at the start of a Claude Code session when you’re adding a new function or module. The skill works best when you write the first test yourself, then hand off to Claude Code to iterate. The failing-first constraint matters most on the first test, when the contract is being established. Later tests in a suite often have the shape of the prior tests to anchor them.

For legacy code without tests, the skill is less helpful as an entry point. Adding tests to existing behavior is exploratory — you’re describing what code does, not specifying what it should do. TDD’s enforcement value comes from specification-before-implementation. If both already exist, the failing step is artificial. Use the skill for new work; use other review patterns for characterization tests on existing code.

The combination that works: you write a test that specifies a new behavior, activate the skill, let Claude Code confirm the failure and implement. The human provides the specification; the agent handles the plumbing; the skill ensures the sequence wasn’t collapsed.

As the codebase grows and Claude Code runs more complex multi-file changes, the enforcement value compounds. A failing test in file A that motivates changes in files B and C is much harder to reason about after the fact than before. The skill keeps the motivation visible: here is the specific failure that caused the specific implementation. That chain of evidence is harder to reconstruct than it is to preserve.

That’s the discipline the skill enforces: the sequence is not optional, and the evidence of each step stays in the conversation.

Start with one function. The pattern makes its value apparent quickly.