Measuring AI coding ROI without fooling yourself

The single most cited statistic about AI coding tools is some flavor of “40% faster” — sometimes 25%, sometimes 55%, but always a big number from a survey or a controlled study. Internal teams measuring their own usage often report numbers in the same range. Vendors quote them. Engineering leaders cite them in budget discussions.

I think most of these numbers are wrong, in a specific way that’s easy to miss and hard to fix.

The methodology problem

The studies fall into a few categories, and each has a structural issue:

Vendor surveys. Self-reported productivity gains from people who use the product. Selection bias is enormous: people who don’t find it useful stop using it and aren’t in the survey.

Controlled studies. Often involve completing pre-defined tasks (write a function, fix a known bug) under timed conditions. These tasks are unrepresentative of the real distribution of work — they’re skewed toward greenfield, well-specified problems where AI tools genuinely help most.

Internal team metrics. Usually one of: PR throughput, lines of code, story points delivered. All of these have well-known issues as proxies for actual value, and they’re particularly fragile in the AI context where verbose AI-generated code can inflate the numbers without inflating the value.

None of these are useless. All of them overstate the gains.

What I actually measured on myself

For three months, I tracked my own work in a way I hoped would be more honest:

Tasks tracked: 67 tasks across the period
Categories: greenfield features, bug fixes, refactors, test coverage, documentation, code review
What I recorded: start time, end time, AI tool used (or none), my subjective difficulty rating (1-5) at the start, whether the work shipped
What I tried to control for: task complexity (via the difficulty rating), my own energy level (via time of day), the codebase (single project throughout)

This isn’t a study. It’s a self-experiment with a sample size of one. But the numbers were less impressive than the surveys imply, and the breakdown by task type matters more than the overall average.

The breakdown

After normalizing for difficulty:

Task type	n	AI gain	High-confidence range
Greenfield features (well-specified)	14	-32% time	-25% to -38%
Test coverage (existing patterns)	11	-41% time	-35% to -48%
Boilerplate (CRUD, scaffolds)	9	-45% time	-40% to -52%
Bug fixes (clear repro)	12	-18% time	-10% to -25%
Bug fixes (vague symptom)	8	-3% time	-10% to +5% (noise)
Refactors (clean, mechanical)	6	-12% time	-5% to -20%
Refactors (subtle invariants)	4	+8% time	-5% to +20%
Code review of others’ code	3	-2% time	noise

Overall weighted average across all 67 tasks: about -18% time, or 22% productivity gain.

That’s real. It’s also less than half of what vendor numbers claim, and the average hides a more important pattern: the gains are concentrated in 3 of the 8 task categories, and barely measurable in the other 5.

What this means

For my work, AI tools are unambiguously useful for specific things:

Writing tests against existing patterns
Generating boilerplate that follows conventions
Scaffolding greenfield features with clear specs

Outside those zones, the productivity story is murky. Bug fixes get a small lift when the bug is well-understood, no lift when it’s vague. Refactors are mostly a wash; subtle ones can actually be slower because the AI introduces inconsistencies you have to spot.

If I had a job that was 80% the categories where AI helps and 20% the categories where it doesn’t, my real productivity gain would be closer to 35%. If I had a job that was 80% the murky categories and 20% the high-gain ones, my gain would be more like 8%.

The vendor average doesn’t apply uniformly across roles. The role mix matters more than the tool.

The cost side, which surveys ignore

The “ROI” question has a denominator. Most claims about productivity gains assume the cost is just the subscription price. The actual cost includes:

Cognitive overhead of reviewing AI output. Every AI-generated suggestion needs review. For greenfield code I’m familiar with, this is fast. For code I’m unsure about, the review is itself a chunk of work, and it’s harder than reading my own code because the AI’s logic isn’t constrained by my mental model.

Time spent on prompt iteration. When the first prompt doesn’t produce useful output, you re-prompt. Each retry is wall-clock time even when the API is fast. The 90-second iteration loop is real.

Skill atrophy in unmeasured ways. This one is hard to quantify. After three months of heavy AI use, I noticed I was less fluent at writing certain patterns I used to write reflexively. This isn’t a productivity loss now; it might be later. Surveys can’t measure it.

Context-switching cost. Switching between “I’m thinking” and “I’m reviewing AI output” is a different kind of switch from typing flow. It’s not exhausting, but it’s not free.

If I conservatively estimate these at 8-12% of the gross gains, the net productivity benefit drops from 22% to closer to 12-14%. Still useful, less impressive.

How to measure on yourself

If you want to do something like this for your own team:

Track time on tasks for a month without AI, then a month with. Same person, similar mix of work. Use a low-overhead timer; don’t make tracking the work part of the work.

Categorize tasks by type. The breakdown is more informative than the average. If your work mix doesn’t match someone else’s, neither do the gains.

Include the cost side. Track time spent reviewing AI output and time spent on retries, not just task-completion time. Net it out.

Don’t trust subjective recall. “I feel like Cursor saves me 30%” is wrong about half the time when actually measured. Track the data, not the impression.

Sanity-check against shipped value. Did 22% time savings translate into 22% more shipped features? Often it doesn’t, because saved time gets absorbed into longer reviews, more polish, or just slower pace. Time-saving isn’t the same as output-increasing.

The decision framework

For most teams, the question isn’t “do AI tools provide gains?” — they do, modestly. The question is “are the gains worth the cost?”

At $20/seat for a team of 30, that’s $7,200/year. If your team’s productivity gain is the median in your work mix — say, 10-15% on average across all task types — that’s roughly 3-4 hours of saved engineering time per developer per month, or 90-120 hours per year per developer. Even at modest fully-loaded engineering rates, that’s $5,000+ per developer per year, well above the seat cost.

The math works. It works less impressively than the marketing claims, and it works only if the time savings actually translate into shipped output rather than evaporating into reduced urgency. That’s the part teams should watch, and it’s the part that doesn’t show up in any survey.

The honest framing

AI coding tools save some time on some tasks, modestly, with some cognitive overhead. That’s enough to be worth using. It’s not enough to justify the breathless claims.

The interesting questions aren’t about average productivity. They’re about which tasks gain most, which gain little, what the cost side actually is, and how the gains translate (or don’t) into team-level output.

If you’re a leader making a buying decision, treat the marketing numbers as upper bounds. Measure your own team. The answer will be smaller than the brochure and probably still positive. Both of those things matter.