Should the AI write the test, the implementation, or both? Three patterns
Published 2026-05-11 by Owner
The question comes up every time you’re working with an AI coding assistant on non-trivial code: who writes the test, and who writes the implementation? There are three practical patterns, each with different tradeoffs. Two of them are solid. One has a failure mode that’s worth understanding before you use it.
The default for most developers is to let the AI do everything — describe the task, get back implementation plus tests, accept both, move on. That’s Pattern C, and on the surface it looks fine. The tests pass. CI is green. The problem only shows up later, when you change the function and the tests keep passing even though the behavior is wrong, or when a bug surfaces that the tests should have caught and didn’t.
Pattern A: AI writes tests, you write implementation
Give the model the function signature and a description of expected behavior. Ask it to write the test file. Then you write the implementation.
// Prompt: "Write tests for this function signature.
// parseAmount(input: string): { value: number; currency: string } | null
// Returns null if input can't be parsed.
// Handles '$12.50', '€ 99', '12 USD'. Rejects '12.50.00'."
// AI produces:
import { describe, it, expect } from 'vitest';
import { parseAmount } from './parseAmount';
describe('parseAmount', () => {
it('parses dollar amounts', () => {
expect(parseAmount('$12.50')).toEqual({ value: 12.5, currency: 'USD' });
});
it('parses euro amounts with space', () => {
expect(parseAmount('€ 99')).toEqual({ value: 99, currency: 'EUR' });
});
it('parses currency code suffix', () => {
expect(parseAmount('12 USD')).toEqual({ value: 12, currency: 'USD' });
});
it('returns null for double-decimal input', () => {
expect(parseAmount('12.50.00')).toBeNull();
});
it('returns null for empty string', () => {
expect(parseAmount('')).toBeNull();
});
});
You now have a specification written as runnable assertions. The tests are wrong in some ways — the AI doesn’t know your actual parsing rules, so you’ll edit them. But that editing process forces you to think precisely about the behavior before writing a line of implementation. That’s the hidden value of Pattern A: the AI’s imperfect attempt at a test forces you to confront the spec decisions you hadn’t made yet. Does '€ 99' get mapped to EUR? What about 'GBP 12' — is the currency prefix or suffix? The AI assumes some answer; you correct it; you’ve now made an explicit decision rather than an implicit one buried in implementation code.
The control stays with you. The model produced a scaffold that you own and verify before it drives any implementation decisions. The tests commit to nothing until you accept them. And because you’re writing the implementation yourself, you know exactly why each case is handled the way it is.
When this fits: exploratory work where the behavior isn’t fully specified yet. The test-writing step forces the spec into concrete form. Also useful when you want to stay close to the implementation logic because the domain is subtle or security-relevant — parsing financial data, handling auth tokens, anything where a subtle misunderstanding in the implementation would be a real problem. The AI gives you the test harness; the human judgment goes into the code that has to be right.
Pattern B: You write tests, AI writes implementation
This is the closest analog to classical TDD in an AI-assisted workflow. You write failing tests that encode your requirements. You hand them to the model and ask it to write implementation that passes them.
// You write:
describe('rateLimiter', () => {
it('allows requests under the limit', () => {
const limiter = createRateLimiter({ maxRequests: 3, windowMs: 1000 });
expect(limiter.check('user-1')).toBe(true);
expect(limiter.check('user-1')).toBe(true);
expect(limiter.check('user-1')).toBe(true);
});
it('blocks the request that exceeds the limit', () => {
const limiter = createRateLimiter({ maxRequests: 3, windowMs: 1000 });
limiter.check('user-1');
limiter.check('user-1');
limiter.check('user-1');
expect(limiter.check('user-1')).toBe(false);
});
it('resets after the window expires', () => {
vi.useFakeTimers();
const limiter = createRateLimiter({ maxRequests: 2, windowMs: 500 });
limiter.check('user-1');
limiter.check('user-1');
vi.advanceTimersByTime(501);
expect(limiter.check('user-1')).toBe(true);
vi.useRealTimers();
});
it('tracks users independently', () => {
const limiter = createRateLimiter({ maxRequests: 1, windowMs: 1000 });
limiter.check('user-1');
expect(limiter.check('user-2')).toBe(true);
});
});
The model implements createRateLimiter to pass these. The constraints are fully explicit — window expiry, per-user isolation, limit boundary — so the AI isn’t guessing at behavior. It’s solving a clearly defined problem. You can review the implementation as a reader of code that has to satisfy a clear contract, which is a much easier review than evaluating “does this implementation do the right thing” without a spec to compare against.
This pattern also works well for bug fixes. When a bug gets reported, the first step is to write a test that reproduces it — a test that currently fails. Then hand the failing test to the AI and ask it to fix the implementation. The fix is verifiable by definition: if the test passes, the specific bug is gone. The AI can’t accidentally “fix” it by removing the test or special-casing the assertion, because you wrote the test and you didn’t give the AI permission to change it.
When this fits: any time you can specify the behavior completely before writing code. API boundary functions, utility logic, data transformations, state machines. The harder the spec is to write, the more valuable this pattern is: writing a precise test suite for a rate limiter forces you to think through the window expiry semantics before any code exists. If writing the test feels hard, that’s useful information — it means the requirements aren’t as clear as they seemed, and the difficulty surfaced before any implementation was built.
Pattern C: AI writes both — and the cheating risk
The most common thing people actually do: hand the AI a task description and let it write tests and implementation together. This is fast. It also has a structural problem.
When the same model writes both, it can satisfy the test suite without testing real behavior. There are two failure modes:
Overfitting the implementation to the test. The implementation special-cases the exact inputs in the tests rather than implementing general logic. The tests pass; the code fails on any input outside the test cases.
Writing weak tests for its own implementation. The model knows what its implementation does. It writes tests that confirm what the code already does rather than probing whether the code does the right thing.
The second failure mode is harder to notice. The tests look real. They have descriptive names, multiple cases, reasonable structure. But they’re testing the implementation’s actual behavior rather than the required behavior.
// A test that looks fine but is almost useless:
it('processes the order', () => {
const result = processOrder({ items: [{ id: 1, qty: 2 }] });
expect(result).toBeDefined();
expect(result.orderId).toBeDefined();
expect(result.status).toBeDefined();
});
This test passes as long as the function returns an object with those fields. It says nothing about whether orderId is valid, whether status reflects actual processing logic, or what happens with edge-case input. An AI that wrote both the function and the test has no incentive to probe those things — it knows the function returns something with orderId, so it asserts that.
The subtler variant of this failure: the model writes tests that look behaviorally specific but are actually derived from the implementation. The model knows its processOrder function generates a sequential integer orderId starting from 1, so it writes expect(result.orderId).toBe(1). This looks like a value check — and it is — but it’s a value check that only captures what the implementation happens to do, not what the contract requires. The real spec might be “orderId is a non-negative integer” or “orderId matches the format returned by the order queue.” The test that passes 1 doesn’t distinguish between those.
This is why Pattern C carries a higher ongoing maintenance cost even when the initial tests look fine. When you change the implementation — maybe orderId now comes from a UUID generator rather than a counter — the test breaks in a confusing way. You’re debugging a test that you didn’t write, for behavior you didn’t specify, failing for a reason that’s actually fine. That’s expensive friction.
The assertion smell test
A quick heuristic for evaluating AI-generated tests: does each assertion test a value, or just that something exists?
| Weak (existence check) | Strong (value check) |
|---|---|
expect(result).toBeDefined() | expect(result.total).toBe(49.99) |
expect(result.status).toBeDefined() | expect(result.status).toBe('fulfilled') |
expect(errors).toHaveLength(1) | expect(errors[0].code).toBe('INVALID_EMAIL') |
expect(typeof result).toBe('string') | expect(result).toBe('2024-01-15') |
If more than a third of the assertions in an AI-generated test file are existence checks — .toBeDefined(), .not.toBeNull(), typeof === 'string' — the tests are probably describing the implementation rather than specifying behavior.
The fix is simple: go through each assertion and ask “what specific value should this be, and why?” If you can answer that, replace the existence check. If you can’t, the spec isn’t clear enough and you have a design problem to solve before worrying about the test.
A corollary: watch for toMatchObject assertions where the object shape barely constrains anything. expect(result).toMatchObject({ success: true }) is almost as weak as .toBeDefined(). The test passes whether result is { success: true } or { success: true, balance: -9999, accountDeleted: true }. A strong assertion constrains the whole relevant output, not just one field you happened to check.
This smell test works on AI-generated tests and human-written tests alike. It just shows up more reliably in Pattern C output, because when a human writes a test first, they have to think about what value to assert before the implementation exists — that friction forces specificity. When an AI writes tests after it has already written the implementation, the specific values are handed to it, so it asserts them. But the values it asserts come from the implementation, not from any external spec.
When to use each pattern
The patterns aren’t mutually exclusive on a project — they’re right for different situations within the same codebase.
Pattern A fits when the spec is fuzzy. Writing tests is a form of specification. Letting the AI draft them and then editing them as a human is often faster than drafting a full written spec first. Keep it for utility functions and boundary code where you need to stay close to the logic.
Pattern B fits when the spec is clear. You know what the function needs to do; you can write assertions before you know how it will be implemented. Rate limiters, parsers, validators, business rules with concrete acceptance criteria. This is also the right pattern for fixing bugs — write a test that reproduces the bug before asking the AI to fix it. The test proves the bug is fixed rather than just gone from the surface.
Pattern C fits for low-stakes one-offs: scaffolding a new file type, a quick helper that runs in a script, code you’ll throw away in a week. Use it with the assertion smell test applied afterward — spend five minutes reading the assertions before accepting the test file. Don’t use it for anything that touches money, auth, or data users trust you to handle correctly. The CI check that says “tests pass” is only as meaningful as the tests are — and in Pattern C, you didn’t write the tests, so you need to verify they’re actually testing behavior before you rely on them.
The earned insight from using all three: the assignment of “who writes the test” is also an assignment of “who owns the specification.” When the AI owns both, the specification is implicit in the implementation. When a human owns either, the specification is explicit — and explicit specifications are the only kind that actually get reviewed.