Anthropic added per-request configuration for Claude’s thinking mode. Tools and developers can now specify how much “extended thinking” time the model spends per request, trading latency for reasoning depth.
For AI coding tools, this is a meaningful capability. Different tasks benefit from different reasoning depths.
What thinking mode is
When Claude generates a response, it can produce internal reasoning before the visible output. This reasoning (“thinking”) helps with complex tasks but adds latency.
Pre-this-release, thinking was either on (slow, high quality) or off (fast, lower quality). The new API lets you specify a thinking budget per request:
response = client.messages.create(
model="claude-3-5-sonnet",
thinking={
"type": "enabled",
"budget_tokens": 5000
},
# ... rest of request
)
The budget is in tokens. Higher budgets allow deeper reasoning. The model chooses how to use the budget; it may use less if the task is simple.
What this changes for tools
For tools using Claude:
Cline plan mode could use longer thinking. Plan generation benefits from thoroughness. A higher thinking budget produces better plans without changing the act execution.
Cursor’s chat could match thinking to task type. Quick questions: low budget. Architecture questions: high budget. The tool routes appropriately.
Aider’s architect mode could use deeper thinking. Architect tasks benefit from reasoning. The editor model doesn’t need it.
Generic prompts could be tuned. “Find the bug in this code” benefits from thinking; “format this JSON” doesn’t.
The key is that tools can now make these decisions per request rather than choosing one mode globally.
Cost implications
Thinking tokens cost the same as output tokens. A response with 5000 thinking tokens costs 5000 × output_price more than a response without thinking.
For Claude 3.5 Sonnet, that’s about $0.075 extra per response with a 5000-token thinking budget.
For interactive use, this is small per request. For high-volume use, it adds up. Tools need to think about when thinking is worth the cost.
A specific use case
For Cline’s Plan mode specifically, a higher thinking budget should produce noticeably better plans on complex tasks.
I tested a refactor task with two configurations:
Default (no extended thinking):
- Plan generated quickly
- Plan covered ~70% of the relevant cases
- Some files affected weren’t in the plan
- Iteration to fix the plan: 2 turns
Extended thinking (5000 tokens):
- Plan took ~15 seconds longer
- Plan covered ~95% of the relevant cases
- All affected files identified
- No iteration needed
For a one-off plan, the 15 seconds and ~$0.07 cost are worth saving 2 iteration turns ($0.30+ in tokens, plus my time).
What’s not yet there
A few things this release doesn’t include:
Auto-routing. The user (or tool) has to decide when to use thinking. An automatic “this looks complex, use thinking” router would help.
Visible thinking. The thinking happens but isn’t shown by default. Some users would want to see the reasoning.
Per-step thinking control. In multi-turn conversations, every turn uses the same thinking config. Per-turn override would help.
Streaming thinking. Currently thinking happens before the visible output. Streaming would let users see progress.
These are reasonable next steps. The current release is the foundation.
Implications for tooling
Tool developers will need to make decisions:
Default thinking budget. What’s the default for most requests?
Per-feature configuration. Should plan mode use more thinking? Should commit message generation use less?
Cost surfacing. Should users see when thinking is happening (and being charged for it)?
Override controls. Should users be able to manually request more or less thinking?
How tools answer these questions affects user experience and cost.
What I’d watch
A few things to track:
- Which tools adopt the new API and how they use it
- Whether competitive pressure pushes other providers to similar features (OpenAI’s o-series has similar capability; Google may follow)
- User feedback on the cost-quality tradeoff
- Anthropic’s continued releases on the reasoning front
The reasoning capability is one of Anthropic’s strengths. Continued investment here makes Claude more attractive for tasks that benefit from thinking.
Worth caring about?
For tool developers: yes. The configurability is a useful primitive. Tools that use it well will produce better outputs at managed costs.
For users on subscription tools: probably not directly visible. Tools will adopt this behind the scenes; you’ll experience it as “the tool got smarter on hard tasks.”
For BYOK users: yes if you’re building custom workflows. The per-request control gives you optimization opportunities.
The pricing observation
Reasoning models (o1, Claude with thinking, Gemini 2.0 Pro thinking) are forming a tier above the standard models. The pricing:
- Standard model: ~$3/M output tokens
- Thinking model: same $3/M output, but you pay more in total because thinking adds tokens
- “Reasoning” model (o1-pro): premium pricing, even more thinking
The category is differentiating. For tasks that benefit from thinking, the cost is justified. For tasks that don’t, the standard tier suffices.
The configurability in this release lets tools make this decision dynamically rather than committing to a tier. Useful for tools that handle varied task types.
Closing
The extended thinking API is a small but meaningful addition. It expands what Claude can do for tools that want to make trade-offs. The tools that use it well will produce better outputs; the tools that ignore it will lag.
For tool builders, this is a feature to integrate intentionally. For users, expect the experience to improve as tools adopt it.