Tinker AI
Read reviews
5 min read Editor

Asking Cursor or Cline to write tests for a function takes 10 seconds and produces tests that pass. The output looks like real tests. The coverage report goes green. The team’s CI pipeline is happy.

I still write tests by hand most of the time. Here’s why I stick with the slower approach despite knowing the faster one exists.

Tests are a specification

A test isn’t just code that runs. It’s a statement: “this function, given this input, must produce this output.” The act of writing that statement forces you to decide what “must” means.

When I write a test by hand for a function I’m about to implement, I’m specifying behavior. What should happen on empty input? What should happen on negative numbers? What does the function do at the boundaries? I’m thinking about these before I write code, which means my code reflects the answers.

When I ask AI to generate tests, the AI specifies behavior based on what it thinks is reasonable. Sometimes its idea of reasonable matches mine. Often it doesn’t, in subtle ways. The function gets implemented to satisfy AI’s specification, which is approximate to mine.

The consequence: my code does what the AI thought was right. The deviation from what I wanted is in the gap between AI’s reasonable defaults and my actual intent. Bugs live in that gap.

The “writes a passing test” problem

Plausible tests are the hard category. AI-generated tests are usually plausible. They have the right shape — describe block, it block, expect call. They cover the obvious cases — happy path, simple error case. They look right at a glance.

What they miss varies but tends toward:

Boundary conditions specific to the domain. A function that processes timestamps should handle UTC offsets, leap seconds, daylight saving transitions. AI tests cover “valid timestamp” and maybe “null.” Domain-specific edge cases require domain knowledge.

Semantic edge cases. A function that calculates a total should handle currency rounding the way your business handles it. AI doesn’t know your business’s rounding rules unless you’ve documented them somewhere it can see.

The cases your users actually hit. What inputs does your production data actually produce? AI doesn’t know. Tests for the cases that actually matter come from looking at your production data and understanding which inputs are common, which are rare, and which are dangerous.

A test suite full of plausible-looking tests is worse than fewer thought-out tests. The plausible suite gives false confidence. The CI passes; nobody questions it; the bug ships.

When AI-generated tests work

I do use AI for some tests. The pattern: I write the spec, AI writes the test code.

The flow:

  1. I write a list of behaviors the function should have, in plain language. About 8-15 behaviors for a typical function.
  2. I have AI translate the list into test code.
  3. I review every test for whether it actually checks the behavior I described.

This works because the thinking part — what should the function do — is mine. The typing part — translating thoughts into Vitest code — is AI’s. The split fits the strengths.

The flow that doesn’t work: “write tests for this function.” I’m outsourcing the thinking, and the AI does its best, and the result is a plausible test suite that doesn’t reflect my intent.

A specific example

Recently I implemented a function that calculates a user’s “engagement score” — a weighted sum of their activity over the past 30 days, with decay.

I asked Cursor to write tests for it. Cursor produced:

describe('engagementScore', () => {
  it('returns 0 for no activity', () => {
    expect(engagementScore([])).toBe(0);
  });

  it('returns higher score for more recent activity', () => {
    const score1 = engagementScore([{ date: '2026-03-20', weight: 1 }]);
    const score2 = engagementScore([{ date: '2026-01-20', weight: 1 }]);
    expect(score1).toBeGreaterThan(score2);
  });

  // ... 6 more tests in this style
});

These tests pass. They look reasonable. They miss:

  • What happens for activity older than 30 days? (Should it count? My business says no.)
  • What about activity with negative weight? (We have refund events with negative weight.)
  • Does the score respect a maximum? (We cap at 100 to avoid skewing dashboards.)
  • What about activity with future dates? (We’ve had timezone bugs that produced these.)
  • Does it normalize for new users vs. old users? (We don’t, deliberately, but the test should pin this down.)

These are the cases that actually matter. They come from knowing my business and how the function is used. AI doesn’t know.

I rewrote the tests by hand. Took about 20 minutes. The rewritten suite has 18 tests. Five of them caught real bugs in the implementation that the AI’s tests missed.

The compounding effect

The efficiency argument for AI-generated tests is “saves 5 minutes per function.” The hidden cost is the bugs that ship past plausible-looking tests.

Six months later, the bug bites in production. Investigation takes 4 hours. The fix takes 30 minutes. Customer impact is real.

In my accounting, the 5 minutes “saved” by AI-generated tests cost 4.5 hours later. The math is bad even before counting the customer impact.

This is the asymmetric cost that’s hard to see in the moment. AI generates tests fast and you ship faster. The tests don’t catch what they should. You ship bugs you wouldn’t have shipped if you’d written the tests yourself. The bugs surface later, in expensive places.

When I go faster anyway

There are categories where I let AI write tests and don’t worry about it:

Smoke tests. “Does this endpoint return 200?” There’s no specification to outsource; the test is just verifying the basic plumbing.

Type-level tests. TypeScript type tests where the assertion is “this type compiles” or “this type doesn’t compile.” The tests aren’t behavioral; they’re structural.

Tests for code I’m throwing away. Prototype code that’s going in the trash next week doesn’t need careful test specification.

Tests for code I didn’t write and don’t fully understand. When AI writes tests for legacy code I’m trying to refactor, the tests pin down current behavior. That’s useful even if I can’t vouch for what they’re checking — they’re a regression detector.

What I tell new engineers

When I’m onboarding someone new to a codebase, I tell them: write the tests yourself for code you’re writing. Use AI for the typing, not the thinking. The discipline of specifying behavior in test form is one of the highest-leverage things you can do for code quality. Don’t skip it because the AI will skip it for you.

This isn’t a Luddite take. I use AI heavily for other parts of the workflow. Tests are the part where the speed gain is illusory because the speed comes from skipping the work that matters.

For tests specifically: think first, then code (or have AI code for you). If you flip the order, you’re not testing your intent; you’re testing AI’s interpretation of your intent. That’s a worse foundation than no tests at all, because it looks like a foundation.