The hidden cost of AI-generated tests that pass

AI coding tools write tests competently. The output passes the linter, runs without errors, and produces a coverage number that goes up. For most teams adopting AI tools, this is one of the most-cited productivity wins. “We can finally cover all this untested code.”

Three months later, the team discovers that some of the AI-generated tests don’t actually catch bugs. They cover lines of code without verifying behavior. Coverage went up; bug detection didn’t.

This is the most common test-quality issue I see in teams using AI extensively. It’s not that AI writes bad tests. It’s that AI writes tests optimized for “this passes” rather than “this catches problems,” and the difference is invisible until something breaks.

The pattern

A function and an AI-generated test, simplified:

// The function
function calculateDiscount(price: number, customer: Customer): number {
  if (customer.tier === 'premium') return price * 0.20;
  if (customer.tier === 'gold') return price * 0.15;
  if (customer.tier === 'silver') return price * 0.10;
  return 0;
}

// The AI-generated test
test('calculateDiscount handles premium customers', () => {
  const result = calculateDiscount(100, { tier: 'premium' });
  expect(result).toBeGreaterThan(0);
});

test('calculateDiscount handles gold customers', () => {
  const result = calculateDiscount(100, { tier: 'gold' });
  expect(result).toBeGreaterThan(0);
});

test('calculateDiscount handles silver customers', () => {
  const result = calculateDiscount(100, { tier: 'silver' });
  expect(result).toBeGreaterThan(0);
});

test('calculateDiscount handles unknown tier', () => {
  const result = calculateDiscount(100, { tier: 'unknown' });
  expect(result).toBe(0);
});

This test file:

Has 4 tests
Achieves 100% line coverage on the function
Passes
Catches almost no bugs

If a developer accidentally swaps the gold and silver discount rates (15% vs 10%), the test still passes — toBeGreaterThan(0) is satisfied by either. If the premium discount is changed from 20% to 2%, the test still passes. The test verifies that something happens, not that the right thing happens.

A better version of the test would assert on specific values:

test('calculateDiscount: premium gets 20%', () => {
  expect(calculateDiscount(100, { tier: 'premium' })).toBe(20);
});

This test catches the discount-rate-swap bug. The AI-generated version doesn’t.

Why AI writes tests this way

A few reasons the failure mode is structural, not just “the AI is bad”:

toBeGreaterThan(0) is in lots of training data. Real test code uses these patterns when the test author isn’t sure what specific value to assert. The AI imitates this in cases where a specific value would actually be appropriate.

Specific assertions risk being wrong. If the AI asserts toBe(20) and the actual value is 21 (because of a rounding rule), the test fails. The AI gets feedback that this is bad. toBeGreaterThan(0) never fails on plausible inputs, so there’s no negative signal.

The AI doesn’t know the spec. It can only infer behavior from the code itself. It can see that the function returns price * 0.20 for premium customers, but it doesn’t know whether 20% is the intended rate or whether someone made a typo when writing the function.

The AI’s tests therefore tend to be self-consistent — they verify that the function does what the function does, not that the function does what it should.

The asymmetry

Here’s the part that makes this a real problem rather than just a theoretical concern:

A test that passes when it shouldn’t is invisible. You don’t see it. The team doesn’t notice. The CI is green. The coverage metric goes up. Everyone congratulates themselves on the new tests.

The cost shows up later, when:

A developer changes the function’s behavior subtly (refactor, optimization, new tier added incorrectly)
The change is technically a bug
The “covered” tests pass anyway
The bug ships
A customer reports it
Someone debugs and discovers the test wasn’t actually verifying behavior

This sequence is hard to attribute back to the AI’s test pattern. By the time the bug is caught, the test was written months ago. The connection between “AI wrote a too-loose test” and “bug shipped” is not visible from the bug ticket.

So the team doesn’t learn from it. The pattern continues. More AI-generated tests get written with the same problem. Coverage stays high. Bug detection stays low.

The pattern that fixes it

The discipline that catches this:

Assert on specific values, not categories. toBe(20) not toBeGreaterThan(0). toEqual({ id: 1, name: 'Alice' }) not toMatchObject({ id: expect.any(Number) }). Specific assertions catch specific bugs.

Test the contract, not the implementation. If the spec says premium customers get 20%, test that they get 20%, not “they get more than 0.” If the spec says invalid input throws a specific error, test for that error type, not “throws an error.”

Verify the test by mutating the code. A useful exercise: change the function deliberately wrong (swap two return values, change a constant) and run the tests. If they pass, the tests are too loose. This is “mutation testing” applied informally.

Be explicit in prompts. When asking AI to write tests, specify: “Use specific value assertions. For each branch, the test should fail if the constant in that branch were changed to any other plausible value.” This is more verbose than necessary in a prompt, but it changes the output substantially.

The last bullet is the highest-leverage habit. Asking the AI for “tests” produces loose tests; asking for “tests that catch specific changes” produces tighter tests. The model can do both; you have to ask.

A real example

A real test the AI wrote for me last month, on a date-formatting helper:

test('formatDateForDisplay handles dates correctly', () => {
  const result = formatDateForDisplay(new Date('2026-03-15'));
  expect(typeof result).toBe('string');
  expect(result.length).toBeGreaterThan(0);
});

This test passes if the function returns literally any non-empty string. "hello" would pass it. The implementation’s actual behavior — formatting as “March 15, 2026” or “2026-03-15” or “15/03/2026” — is unverified.

After I noticed and re-prompted with “test the specific format we expect”:

test('formatDateForDisplay produces "Month DD, YYYY" format', () => {
  expect(formatDateForDisplay(new Date('2026-03-15'))).toBe('March 15, 2026');
});

test('formatDateForDisplay handles single-digit days correctly', () => {
  expect(formatDateForDisplay(new Date('2026-03-05'))).toBe('March 5, 2026');
});

test('formatDateForDisplay handles end-of-month correctly', () => {
  expect(formatDateForDisplay(new Date('2026-12-31'))).toBe('December 31, 2026');
});

These actually verify behavior. They’ll fail if someone changes the format. They’re useful tests.

The difference is in the prompt. The AI can write either kind. The default is the loose kind.

What teams should actually do

Three habits that catch this in teams:

Code review for tests, not just code. If your team’s PR review focuses on the implementation and skims the tests, AI-generated tests slip through. Make tests as scrutinized as the code.

Regular mutation testing. Even informal — once a quarter, run a tool like Stryker or pick a few random functions and break them deliberately to see if tests catch the breakage. The breakage rate is your real test quality.

Explicit prompt patterns for tests. Bake the “test the contract specifically” instruction into your team’s prompts. Make it a default, not an afterthought.

Don’t trust coverage as a metric. Coverage tells you which lines were executed by tests. It doesn’t tell you whether the tests actually verify behavior. A 90%-covered codebase with loose tests catches fewer bugs than a 60%-covered codebase with tight tests.

The honest summary

AI-generated tests are a real productivity gain — when the tests catch real bugs. The default output catches fewer bugs than human-written tests, not because the AI is incompetent but because its training distribution skews toward loose assertion patterns and away from the kind of specificity that makes tests useful.

The fix is a few prompt habits and code review attention. Teams that adopt these habits get the productivity gain without the bug-detection regression. Teams that don’t end up with high coverage and roughly the same bug rate as before, which is a worse outcome than either having human tests or having no tests at all.

Coverage is a vanity metric without test quality. AI makes coverage cheap. Make sure you’re optimizing for what coverage was supposed to proxy for, not just the number itself.