Extended thinking on Claude Code: where the extra tokens earn their keep

Published 2026-05-11 by Owner

There’s a class of Claude Code turns where the model gives you an answer that’s technically plausible but missing something — an edge case it didn’t consider, a constraint it underweighted, a path it should have tried before committing to the one it took. Extended thinking is the feature designed for exactly those situations. It’s also frequently enabled on tasks that don’t need it, running up token costs without improving results.

The mechanic is straightforward: before producing its visible response, the model generates a stream of reasoning tokens that only it can see. It works through the problem — tries approaches, rejects some, considers alternatives — and then produces an answer informed by that process. You pay for the thinking tokens at the output rate, but they never appear in your conversation. What you sometimes get access to is a thinking trace: a window into the reasoning the model performed before responding.

Whether you’re billed for thinking tokens you can’t see, and under what conditions the trace is surfaced, depends on how Claude Code exposes the API. The key point is that thinking tokens are real computational work billed at output token rates. Enabling extended thinking on a turn that didn’t need it is equivalent to asking the model to write a rough draft and then throw it away — you paid for the draft.

The tasks where extended thinking demonstrably helps

Extended thinking changes outcomes on tasks with a specific structure: there are multiple possible approaches, choosing poorly early makes things worse later, and the cost of a wrong assumption compounds as the turn progresses.

Ambiguous or underspecified requests. “Refactor this to be more maintainable” is a sentence that can mean twenty different things. Without extended thinking, the model matches the most statistically likely interpretation and runs with it. With extended thinking, it reasons through the possible interpretations, considers the codebase context, and often surfaces the ambiguity rather than silently resolving it in one direction. The result is either a better-targeted response or a clarifying question that saves you a bad refactor.

Multi-step tasks with ordering dependencies. Migrating a module that other modules depend on, or restructuring a shared type that threads through multiple files — these tasks have a natural ordering where getting step 3 wrong is harder to recover from if step 1 was already committed. Extended thinking lets the model plan the full sequence before touching anything, which surfaces the ordering problem at planning time rather than mid-execution.

“What breaks if we change X” hypotheticals. Impact analysis — understanding the downstream effects of a proposed change — requires holding a mental model of the system and tracing implications through it. This is exactly the kind of back-and-forth reasoning that the thinking phase is structured for. The model can work through the call graph, consider the edge cases, and give you a more complete picture of the risk surface.

Debugging sessions with multiple candidate causes. When a bug could plausibly come from three different places, the thinking phase lets the model evaluate the evidence for each cause and reason to a ranked hypothesis before proposing a fix. Without it, the model commits to the most salient candidate, which is often correct but sometimes wrong for reasons the thinking trace would have caught.

A simple example: asking Claude Code to write a database migration that needs to be backward-compatible and handle partial rollout states. The constraints interact in non-obvious ways, and the thinking phase lets the model reason through the interaction before producing SQL that might look fine on its face but fail under a specific deployment timing.

# A prompt that benefits from extended thinking:
# "This migration needs to handle the case where new code is running
# against the old schema during a rolling deploy. Walk through what
# can go wrong and then write the migration."
#
# vs. a prompt that doesn't:
# "Add a created_at column to the users table."

The second prompt has one correct answer. Thinking tokens spent on it are wasted.

Tasks where thinking is overhead

The flip side is equally important. Extended thinking adds cost without adding quality on tasks where the answer is essentially determined by the specification.

Well-specified file edits. “Rename getUserById to fetchUserById across all files in src/api/.” There is one correct response — do the rename. The model doesn’t need to reason through alternatives. Thinking tokens spent here produce a reasoning trace that says approximately “I will rename the function” and nothing more. You paid for the draft; the draft added nothing.

Mechanical transformations. Converting a list of types from one format to another, generating docstrings for a list of functions, reformatting a config file to a new schema. These are transcription tasks with a clear mapping. The quality ceiling is in the accuracy of execution, not in the depth of reasoning about what to execute.

Boilerplate generation from explicit specs. If you give the model a precise spec — “generate a REST endpoint that accepts these fields, validates them with this schema, and returns this response shape” — extended thinking doesn’t improve the output. The spec already contains all the information the reasoning phase would have surfaced. Thinking tokens spent on a complete spec are wasted tokens.

The threshold that works in practice: if you could write out the complete answer yourself in your head before asking the question, extended thinking probably won’t help. If you’re asking because you’re genuinely unsure how the pieces fit together, it probably will.

There’s also a subtler trap: tasks that feel complex but are actually well-determined. “Add pagination to this endpoint” sounds open-ended but usually has one obvious implementation given your existing patterns. The model doesn’t need to reason about alternatives — it needs to follow the pattern you already use. Framing the task as ambiguous when it isn’t is how thinking tokens get wasted on false complexity.

Cost implications and budget control

Thinking tokens are billed at the output token rate, which is typically several times more expensive than the input token rate. Enabling extended thinking on a non-trivial turn can multiply the effective cost of that turn by 2 to 5 times compared to the same turn without thinking.

This matters more on Claude Opus than on Sonnet or Haiku, because Opus already carries a higher base token cost. An Opus turn with extended thinking at a 5x thinking multiplier is expensive. An Opus turn without it is expensive but manageable. Running extended thinking uncritically on every turn is the fast path to a surprisingly large bill.

The practical controls depend on how the API is exposed through Claude Code’s interface, but the general categories are:

Per-turn budget caps — limiting how many thinking tokens can be generated on a single turn. Useful for ensuring that even a “thinking-heavy” turn has a cost ceiling.
Per-session budgets — total thinking token limits for a session. Forces deliberate prioritization of which turns actually need thinking.
Model routing — using extended thinking selectively on Opus for planning and analysis turns, while routing mechanical execution turns to Sonnet without thinking enabled.

The architect/editor pattern applies here too. Thinking-enabled Opus handles the turns where reasoning depth changes outcomes. Thinking-disabled Sonnet handles the turns where the plan is already set and execution is the task.

For a realistic multi-file refactor session, a sensible allocation might look like:

Turn 1  (Opus + thinking):  Analyze affected files, identify edge cases, produce migration plan
Turns 2-8 (Sonnet, no thinking): Execute individual file changes per the plan
Turn 9  (Opus + thinking):  Review completed changes, check for inconsistencies

That pattern pays full thinking costs on two turns out of nine and gets the quality benefit on the turns where it matters.

One thing worth tracking: thinking token usage isn’t always visible in Claude Code’s standard cost display. If your session costs are higher than expected, extended thinking being enabled on more turns than intended is a common cause. Check whether thinking is on by default in your configuration and whether that default makes sense for how you actually use the tool.

Reading the thinking trace for debugging

When a Claude Code turn produces an answer that’s wrong or incomplete, the thinking trace is the most direct path to understanding why. It shows you the model’s internal reasoning, including the paths it considered and rejected and — critically — the assumptions it made that you never stated explicitly.

The failure mode it most reliably catches: the model formed a correct-looking plan based on a false premise, and the false premise went unexamined in the visible response. In the thinking trace, the premise is visible as an explicit claim. When you read it, you recognize it’s wrong, and the error in the final answer suddenly makes sense.

A few patterns I’ve seen consistently in thinking traces:

The premature commitment. The trace shows the model evaluating two approaches, picking one, and then not revisiting the choice even when later reasoning in the same trace suggests the other might have been better. The visible response reflects the committed approach; the trace shows the model talked itself out of reconsidering. Prompt fix: ask the model to explicitly evaluate both approaches before committing to either.

The invisible assumption. The trace includes a sentence like “assuming X is Y” that’s never stated in the visible response. X is Y in most codebases but not yours. The answer is wrong for exactly that reason. Prompt fix: state the correct value of X explicitly.

The scope creep. The trace shows the model reasoning about a larger change than the prompt requested, deciding to include it “for completeness,” and then including it in the response. The visible response contains more than you asked for. Prompt fix: add an explicit scope constraint (“do not change anything outside of X”).

Using the trace to second-guess the final answer is usually counterproductive. If the reasoning is sound but you disagree with the conclusion, the right move is to argue with the model, not to pick a different answer from the trace. The trace is for understanding why the model went wrong, not for cherry-picking an intermediate state you prefer.

The other use for thinking traces is prompt iteration. If the model keeps getting a category of task wrong and you can’t figure out why from the visible responses, enabling extended thinking and reading the traces often reveals the misunderstanding quickly. Once you see it, you can fix it once in the prompt rather than correcting outputs case by case.

Where this is headed

The thinking token budget is a dial, not a switch. As Claude Code develops, the expectation is that this dial becomes more granular — routing specific task types to specific budget levels, and eventually letting the model itself determine when additional reasoning would change its answer.

That last capability, sometimes called adaptive thinking, is where the cost picture gets genuinely interesting: pay for thinking only when thinking would have changed the output. Until that’s reliable, the manual approach — treating thinking as a tool to reach for on specific classes of hard problems, and leaving it off for everything else — is the one that keeps quality high and costs reasonable.

The teams getting good results with extended thinking aren’t enabling it everywhere and watching the quality ceiling rise. They’re enabling it selectively on the turns that actually need it and staying disciplined about the rest.