Tinker AI
Read reviews
intermediate 5 min read

TDD with aider: writing the test first and letting the model fill in the implementation

Published 2026-04-05 by Owner

Aider works better when you give it a specific failing test as context. Better than asking it to “implement and test feature X” simultaneously. The structure of TDD — write the test first, watch it fail, write enough code to pass — fits aider’s strengths in a way that the broader “do everything” prompt doesn’t.

I’ve been using aider in a TDD-first style for about six months. Here’s the workflow.

The problem with “implement and test”

If you ask aider to “add a function that validates email addresses and write tests for it,” what you get is:

  • An implementation that passes its own tests
  • A test suite that exercises the implementation
  • Both shaped by the model’s assumptions about edge cases

Where this goes wrong: the implementation defines what “correct” means. The tests check that the implementation does what the implementation does. If the implementation has the wrong idea about Unicode handling or domain length limits, the tests confirm that wrong idea.

You can review and catch this. But review is harder when the test is also AI-generated, because you’re checking a moving target.

The TDD alternative

Write the test by hand. Be specific about edge cases:

import { validateEmail } from './email';

describe('validateEmail', () => {
  it('returns true for a valid email', () => {
    expect(validateEmail('user@example.com')).toBe(true);
  });

  it('returns false for missing @ sign', () => {
    expect(validateEmail('user.example.com')).toBe(false);
  });

  it('returns false for empty string', () => {
    expect(validateEmail('')).toBe(false);
  });

  it('handles Unicode local parts', () => {
    expect(validateEmail('用户@example.com')).toBe(true);
  });

  it('rejects domains shorter than 4 characters', () => {
    expect(validateEmail('a@b.c')).toBe(false);
  });

  it('rejects local parts longer than 64 characters', () => {
    const long = 'a'.repeat(65);
    expect(validateEmail(`${long}@example.com`)).toBe(false);
  });
});

Save this. Run it. Watch it fail (the function doesn’t exist yet).

Now ask aider:

> /add src/lib/email.ts
> /add src/lib/email.test.ts
> implement validateEmail to make these tests pass. do not modify the tests.

The model has a precise specification. Each test is an assertion about behavior. The model writes code to satisfy them.

Why this produces better code

Several reasons:

The spec is in your hands. You decide what edge cases matter. The model doesn’t get to invent or skip cases.

Review is smaller. When you review aider’s output, you’re checking that the implementation matches your spec. The spec is something you wrote and thought about. The implementation is the only AI-generated part.

Failing tests are clear feedback. When aider’s first attempt doesn’t pass, the failing test names point at exactly what’s wrong. The next prompt can be specific: “the Unicode local parts test is failing; the regex isn’t matching 用户.”

You catch implementation tricks. Aider sometimes finds clever-but-wrong ways to pass tests (memoizing a hardcoded list, special-casing the test inputs). When the test suite is yours and you’ve thought about it, you notice this. When the tests are also AI-generated, the trick passes review.

A workflow rhythm

The rhythm I use:

  1. Write 1-3 tests by hand (the most important behaviors)
  2. Run the tests; verify they fail in expected ways
  3. Ask aider to implement the function to pass them
  4. Review the implementation
  5. Add more tests for additional cases
  6. Repeat

The cycle is fast — typically 2-5 minutes per iteration on a function. The implementation evolves with the test suite. By the end of a feature, I have implementation and tests that I trust because I wrote the spec.

For larger features, the same pattern scales. Write the test for the public API behavior. Let aider implement. Then write tests for the next layer down (helper functions, internal contracts) and repeat.

What aider does well in this mode

Boring implementations. When the test specifies “function returns sum of array,” aider produces a one-liner reduce. No drama, no over-engineering.

Edge case handling. When a test says “throws on empty input,” aider writes the throw. It doesn’t add a “graceful fallback” you didn’t ask for.

Mechanical translation. Pattern from the test (input shape, output shape) into implementation. Aider is reliable at this when the test is unambiguous.

Refactoring while preserving tests. “Rewrite this function to be more efficient; tests must continue to pass.” Aider stays within the spec.

What aider does poorly in this mode

Tests that are hard to write first. Some functions are hard to specify before you know what’s possible. Functions that integrate with external services, functions that depend on complex global state, functions that produce probabilistic output. For these, TDD isn’t always the right approach with or without aider.

Tests that imply more than they specify. A test that “user can log in successfully” specifies the happy path but not the error paths. Aider implements the happy path correctly and may or may not handle errors well. You’d add more tests; the workflow still works, just slower.

Performance constraints. Tests can verify behavior; they can’t easily verify “this completes in under 10ms with 10k inputs.” Aider produces working but potentially slow code. Add explicit performance assertions or know to look for performance issues in review.

The instruction that matters most

The single most important line in the prompt: “do not modify the tests.”

Without it, when aider’s implementation fails a test, the model occasionally “fixes” the test instead of the implementation. The test “fix” usually changes the assertion to match the buggy implementation. You end up with passing tests and broken behavior.

With “do not modify the tests” in the prompt, the model fixes the implementation. If it can’t, it tells you what’s wrong. That’s the right failure mode.

I include it in every TDD prompt. Aider follows it reliably. Cline and Cursor sometimes don’t follow it as strictly; for those tools, you’d want to verify tests didn’t change after each turn.

Where this changes what I write

A side effect of TDD with aider: I write more tests than I used to. The cost of writing a test was, before, “I have to write the test, then I have to write the implementation, then I run them.” With aider in the loop, the cost is “I write the test, I check that aider’s implementation passes it.” The implementation is largely free, which makes the test the bottleneck.

The natural response is to write more thorough tests. More edge cases get covered because the cost of covering them is just thinking about them, not implementing for them.

This is genuinely a productivity win. It’s the only workflow change from AI tools that I’d describe as “I write better-tested code now than I did before AI” rather than “I write code faster now.”