Tinker AI
Read reviews
intermediate 7 min read

The systematic-debugging skill: when the agent keeps proposing the wrong fix

Published 2026-05-11 by Owner

There is a specific pattern Claude Code falls into with non-trivial bugs. The agent reads the error, proposes a one-line fix, applies it, runs the test, sees the same error in a slightly different form, and proposes another one-line fix. After three rounds of this, the code has changed in four places, nothing is fixed, and the diff is a mess of half-measures that don’t connect to each other.

The underlying problem is that the agent is treating symptoms as causes. It sees TypeError: Cannot read properties of undefined and patches the undefined reference. It does not ask why the reference is undefined. The fix lands; the bug mutates and resurfaces.

This pattern has a specific name in the Superpowers skill library: the guess loop. The agent is not debugging — it is running experiments with no hypothesis. Each experiment changes the system slightly, the result is ambiguous, and the agent generates the next experiment based on the changed system. Progress is illusory; the underlying cause is unchanged.

The systematic-debugging Superpowers skill is a structured four-phase protocol that prevents this. The mechanism is an explicit gate: the agent cannot move to a fix until it has stated a root cause and given reasoning for it. Skipping ahead fails the skill’s logic.

The Iron Law: no fix without a root cause

The Iron Law is the principle the skill enforces. Stated plainly: every bug has a root cause. Fixing anything other than the root cause produces a symptom fix. Symptom fixes recur.

This sounds obvious. In practice, agents violate it constantly because their training rewards producing a plausible-looking fix quickly. The training data — millions of GitHub commits, Stack Overflow answers, and tutorial code — is full of band-aids presented as solutions. The model learned from that.

The Iron Law resets the incentive. When the skill is active, a “try this and see” response is not a valid output. The agent must commit to a root cause before it can propose an implementation. This is uncomfortable for the model because it forecloses the easy path, but it’s what makes the fix durable.

The four phases

The skill structures debugging into four explicit phases. Each phase has a specific job. Phases must complete in order.

INVESTIGATE — gather observable facts about the failure. What is the error message, exactly? What is the stack trace? What are the inputs that trigger the bug? What are the conditions under which it does not appear? The goal is a complete, written-down picture of the failure. No hypotheses yet — just evidence.

ANALYZE — read the evidence. Trace through the relevant code. Identify the system that is failing: which module, which function, which data flow. The goal is to know where in the codebase the bug lives and what the code at that location is actually doing (as opposed to what it should be doing).

HYPOTHESIZE — propose a specific root cause with reasoning. Not “maybe the state is wrong” — that is a guess, not a hypothesis. A hypothesis is: “The state is wrong because fetchUser resolves before the auth token is set, so the request returns a 401, the response is discarded silently, and the component renders with an empty user object instead of falling back.” The hypothesis names a mechanism, not a symptom.

IMPLEMENT — fix the root cause identified in the hypothesis. If the hypothesis is wrong, the fix will not work and the investigation resumes. This is fine; the point is that the implementation is always targeted at an identified cause, not at whatever looks wrong at the surface.

The four-phase structure turns debugging from a loop of random changes into a directed investigation. Each phase terminates with a clear deliverable. The agent cannot plausibly skip HYPOTHESIZE and jump to IMPLEMENT — the skill will not produce an implementation turn without a stated hypothesis.

One thing worth noting about HYPOTHESIZE specifically: the quality of the hypothesis matters more than the speed at which it is produced. A weak hypothesis — “maybe the state is stale” — is not sufficient. The required form is a complete causal chain: what is happening, why it is happening, and why the system is set up to allow it to happen. Writing that out forces the agent to confront whether the analysis actually supports the hypothesis or whether the analysis is incomplete.

When the hypothesis is right, the IMPLEMENT step becomes nearly mechanical. The agent knows the root cause, knows where it lives in the code, and writes a targeted fix. When the hypothesis is wrong, the failed fix is diagnostic: the investigation re-enters ANALYZE with new evidence from the failure.

The tell that the agent is guessing

Three behaviors signal that the agent is in symptom-fix mode rather than root-cause mode:

One-line fixes for multi-step bugs. If the agent’s proposed fix is a null check, a type cast, or a default value, but the error is appearing inside a complex async flow, the agent is patching a surface symptom. The actual bug is in the flow, not in the null.

“Try this and see if it works.” This is the clearest signal. A root-cause fix does not need empirical verification of whether it works — you know it works because you know why it works. “Try this” means the agent is guessing.

Changes without explanation of mechanism. If the agent edits code without explaining what causal link the edit breaks — why this change prevents the specific failure — the edit is a guess. Correct fixes come with a causal narrative.

When any of these appear, the debugging session has left root-cause territory. The right move is to restart from INVESTIGATE, not to apply the guess and see what happens.

A bug walkthrough

This is a condensed version of a real debugging session where the systematic-debugging skill changed the outcome.

The bug: a Next.js API route that performs database writes was occasionally returning a 500 error, but only in production. Local development was clean. Error logs showed Cannot read properties of undefined (reading 'id') but the stack trace was truncated in the production logging setup.

The agent’s first three guesses (without the skill):

Guess 1: "The user object might be undefined. Add a check:
if (!user) return res.status(401).json({ error: 'Unauthorized' })"

Applied. Bug still occurs.

Guess 2: "The database connection might be timing out. Add a retry:
await db.connect({ retry: true })"

Applied. Bug still occurs.

Guess 3: "The body parser might not be running. Add:
export const config = { api: { bodyParser: true } }"

Applied. Bug still occurs. Now three changes in the codebase, none of
them connected to each other, and the bug is unchanged.

At this point the diff has noise in it. Rolling back means removing three changes that each seemed plausible when they were made.

What the skill forced (back to INVESTIGATE):

The skill required gathering facts before proposing anything. That meant:

  • Reproducing in a staging environment with full logging enabled
  • Reading the full error, not the truncated version
  • Tracing which specific request triggered the 500

With full logging, the error was Cannot read properties of undefined (reading 'id') on the line that reads req.session.user.id. The stack trace pointed to the session middleware.

ANALYZE: The session middleware was Vercel’s Edge Runtime session library. In production, the route was deployed to Edge Runtime. In development, it ran in Node.js. The session library’s API differed between runtimes: in Edge Runtime, req.session was populated asynchronously and required an explicit await that was not in the code.

HYPOTHESIZE: The root cause is that req.session is not awaited before use in Edge Runtime. The session object is a promise in that context, not a resolved object. req.session.user is therefore undefined.user, producing the error. This does not occur locally because the Node.js runtime resolves session synchronously.

IMPLEMENT: Add await on the session read, with an environment check to avoid breaking local development.

One change. The bug was gone. The three earlier changes were reverted — none of them were wrong exactly, they were just irrelevant.

The time from “start INVESTIGATE” to “bug fixed” was about 35 minutes. The guessing loop had already consumed 25 minutes without progress.

When this is overkill

The systematic-debugging skill is overhead for a category of bugs where the root cause is immediately visible.

A typo in a variable name is not a debugging problem — it is a reading problem. The agent sees userNaem and knows the fix. Running through four phases of a structured protocol for that is theater.

Clearly-wrong code is similar. If a function is supposed to return an array and is returning undefined because the return statement is missing, the cause is visible in the first read. No investigation is needed.

The useful threshold: if the bug requires understanding interactions between two or more pieces of code, or if it behaves differently across environments, or if the error message does not point directly at the cause, the skill is warranted. A bug that is confusing enough that you would reach for “try this and see” is a bug that needs the four phases.

For single-file, obvious-cause bugs, skip the skill and fix directly. The protocol is not valuable as a ritual; it is valuable as a guard against the specific failure mode of treating symptoms as causes.

What changes with the skill active

The practical change is that the agent’s first response to a bug report is a set of questions and observations rather than a proposed fix. It asks for the full error, the stack trace, the steps to reproduce, and what the expected behavior is. This takes one extra turn.

That turn is where most debugging sessions are actually won or lost. A session that starts with full information produces a correct hypothesis quickly. A session that starts with “here’s the error message” and immediately proposes a fix produces the guess loop.

The skill’s value is concentrated in two places: forcing the full-information gather before any hypothesis, and forcing the hypothesis to be a mechanism rather than a symptom description. Everything else follows from those two gates.

There is a secondary benefit that shows up over time. When the agent explains the causal chain for a bug fix, that explanation is reviewable. You can read the hypothesis, check whether the evidence supports it, and catch cases where the reasoning is weak before the change lands. With guessing-mode fixes, there is nothing to review — the change is presented as a fact, not as a reasoned conclusion.

The systematic-debugging skill is part of a broader class of Superpowers skills that impose structured protocols on tasks where agents tend to take shortcuts. It is most valuable exactly when the temptation to shortcut is highest — when you are under pressure, the bug is urgent, and “try this” sounds faster than “investigate first.”

It is not faster. It only feels faster.

Bugs fixed by root cause do not come back. That’s the payoff — not just the individual resolution, but the reduction in the class of recurring bugs that were never actually resolved, only masked.