Tinker AI
Read reviews
5 min read Editor

Most AI coding tools advertise some version of “automatic debugging.” You point the agent at a failing test or an error message, and it iterates until the issue is fixed. The marketing is compelling. The actual experience is mostly frustrating.

I’ve used autonomous debugging in Cline, Cursor’s agent, Windsurf Cascade, and Aider over the past year. Across maybe 200 debugging sessions, I’d estimate the AI succeeded autonomously about 30% of the time. The other 70% are interesting in their failure modes.

What autonomous debugging promises

The pattern: you have a failing test or a stack trace. Instead of investigating yourself, you tell the AI “this is failing, fix it.” The AI:

  1. Reads the failure
  2. Looks at the relevant code
  3. Forms a hypothesis
  4. Tries a fix
  5. Re-runs the test
  6. If still failing, refines the hypothesis and tries again

When this works, it’s magic. The AI walks through the steps you would have walked through, but faster. You come back from getting coffee and the test is green.

When it doesn’t work, it’s worse than starting over. The AI’s iterations leave the codebase in a half-changed state. Reverting to the original is messy. The AI’s hypotheses are encoded in commits or chat history that you have to wade through to understand what was tried.

The 30% where it works

The cases where autonomous debugging succeeds are predictable:

Simple errors with clear messages. TypeScript “property X does not exist on type Y” — the fix is obvious to a human and obvious to the AI. The AI tries it, the error is gone.

Failed tests with explicit assertions. “Expected 5, got 4” — if the function should return 5, the AI usually finds the off-by-one.

Errors caused by recent changes. When the failure is in code the AI just wrote, the AI usually fixes it. The model has the context for the change in mind.

Pattern-match bugs. “This async function isn’t awaiting somewhere; tests are flaky.” The AI scans for missing awaits and fixes them.

Linting and type errors. Mechanical fixes that don’t require understanding intent.

For these, autonomous debugging is genuinely faster than manual debugging. Use it.

The 70% where it doesn’t

The failure modes worth knowing about:

Wrong hypothesis, looks plausible. The AI generates a hypothesis (“the cache is stale”), implements a fix (“invalidate the cache before the test”), and the test passes. The cache wasn’t actually the problem; the test passes because the AI’s “fix” inadvertently changed behavior in a way that papers over the real bug. You’ve now shipped a worse codebase with a hidden bug.

Logic bug, not a bug-bug. Some test failures are because the test is wrong, not the code. The AI doesn’t usually consider this and tries to “fix” the code. It might succeed by changing the code to match the test’s wrong expectation, which is wrong.

Subtle invariant violation. Bugs in stateful systems where the invariant is implicit in the code. The AI sees the symptom (a test fails) but not the invariant. It changes things until the symptom goes away. Often it goes away because the AI “fixed” something that doesn’t actually fix the invariant.

Race condition. The AI sees an intermittent failure and adds retry logic, increasing timeout, or running the test in isolation. The race still exists; the symptom is now hidden.

Cross-system bug. The failure is in service A because service B is misbehaving. The AI looks at A, “fixes” something in A that paper-overs the actual issue. Service B is still broken.

In all these cases, the AI’s iteration converges on “test passes” rather than “bug is fixed.” The two are different. Tests can pass for the wrong reasons.

The expensive failure mode

The failure mode that costs the most is “AI partially fixes the bug but introduces a different one.” The original symptom goes away; a new symptom appears later. By “later” I mean often weeks later, when someone hits the new edge case in production.

Investigating that new bug requires re-tracing the AI’s iteration. The AI’s commits don’t explain the reasoning behind each change; the chat history is large and meandering. You’re effectively reverse-engineering what the AI was thinking, which is harder than just thinking it through yourself the first time.

This is the asymmetric cost. Autonomous debugging that works saves you 20 minutes. Autonomous debugging that fails badly costs you 4-8 hours later. If 30% works and 70% fails, even with mixed-severity failures, the math is unfavorable for non-trivial bugs.

When I use it anyway

Despite the failure rate, I do use autonomous debugging in specific cases:

For bugs I’d consider trivial. TypeScript errors, missing imports, obvious typos. The AI fixes these in seconds. Not worth thinking about.

As a first attempt before I dig in. I let the AI try once. If it succeeds in 1-2 iterations, great. If it doesn’t, I take over. I don’t let the AI iterate more than twice without me reviewing.

For bugs in code I just wrote. The AI has context. The bug is usually a small thing. Autonomous fix works most of the time.

For lint and CI failures. These are mechanical. The AI handles them well.

The limit I impose: autonomous debugging is a quick attempt, not a strategy. If it doesn’t work in two iterations, I take over.

What I do instead for hard bugs

For real bugs, the workflow is:

  1. Reproduce the bug myself, in isolation if possible
  2. Form my own hypothesis about the cause
  3. Use AI to test the hypothesis (write a probe, check a value, etc.) — not to fix
  4. Once I understand the cause, decide on a fix
  5. Implement the fix (myself or with AI help, depending)

The AI is a research tool here, not an autonomous fixer. It accelerates the investigation without trying to short-circuit the thinking. This is the workflow that produces real fixes, not symptom suppressions.

The pattern I’ve internalized: AI is great at executing a plan I’ve made. AI is bad at making the plan when the plan requires understanding causality in a complex system.

What the marketing should say

Honest marketing for autonomous debugging would be:

“AI debugging works for routine errors with clear messages. For complex bugs, AI iteration produces plausible-looking fixes that may not address the underlying cause. Use AI as a research aid for non-trivial bugs, not as an autonomous fixer.”

This is, of course, not the marketing. The marketing emphasizes the magical experience of pointing AI at a bug and walking away. That experience exists. It exists for ~30% of bugs. For the other 70%, the AI produces something that looks like a fix but isn’t.

The tool vendors aren’t lying. They’re highlighting the success cases. The buyer needs to know about the failure modes that the marketing doesn’t mention.

A take I’ll defend

Autonomous debugging is the feature that needs the most user discipline. With other AI features (autocomplete, code generation, refactoring help), wrong output is usually obvious — the code looks weird, the tests fail, the type checker complains. With autonomous debugging, wrong output is hidden — the test passes, the symptom is gone, you don’t see the new bug until later.

If you’re using autonomous debugging without a discipline of “check whether the fix actually addresses the bug or just hides it,” you’re shipping technical debt without realizing it. The cost will appear in the future, in incidents you have to debug, in customers reporting issues that should have been caught.

The marketing won’t tell you this. You have to know.