Output token budgets in Claude Code: avoiding the truncation surprise

Published 2026-05-11 by Owner

You paste a 600-line file and ask Claude Code to refactor the whole thing. The model starts generating. Somewhere around line 400, the output stops. No error. No warning. Just… less than you asked for.

That’s the output token ceiling, and it hits differently than running out of context.

Why output tokens are the real bottleneck

Most people track token spend on the input side—context windows, conversation history, how many files you’ve attached. That’s the wrong thing to watch for truncation.

Claude models have a separate limit on how many tokens they can generate per turn. For Claude 3.5 Sonnet and Claude 3.7 Sonnet, the max output is 8,192 tokens. Claude 3.5 Haiku is the same. The extended thinking variants can go higher, but in standard Claude Code operation you’re working with the 8k ceiling by default.

Eight thousand output tokens is roughly 400–600 lines of dense code, depending on how verbose the language is. TypeScript interfaces and JSDoc? You’ll hit the ceiling faster. Terse Python? You get a bit more room.

The crucial distinction: a model can hold a 200k-token conversation in context and still only generate 8k tokens in a single response. The context window is for reading; the output budget is for writing. Running out of input context gives you a “context window exceeded” error. Running out of output tokens just ends generation silently.

Where it actually shows up

Truncation doesn’t happen uniformly. It clusters around specific request patterns:

Full-file rewrites. “Rewrite this component to use the new API” sounds like a scoped task. If the component is 700 lines, the rewrite is also 700+ lines. That’s a common path to a truncated result.

Large boilerplate generation from schemas. “Generate TypeScript types and Zod validators from this OpenAPI spec” can easily produce 1,000+ lines. Claude starts generating the first few types perfectly, then the output terminates mid-interface.

Test suite generation. “Write tests for this service class” where the service has 20 methods often hits the ceiling. The first 10-12 methods get thorough tests. The remaining methods get nothing, with no indication they’re missing.

Multi-file scaffolding. Asking Claude Code to scaffold an entire module—create the controller, the service, the repository, the tests—in a single prompt tries to push four files’ worth of content through a single output budget.

The maddening thing is that the truncated output is often not obviously incomplete. If you’re generating a test file and the last test happens to end on a closing brace, the file looks syntactically valid. It’s only when you run it and notice the missing coverage that you realize the generation stopped early.

Large YAML or JSON generation hits this just as often as code. Ask Claude Code to populate a config file from a complex schema, and the model will produce valid partial YAML that passes a syntax check but is missing half the keys. Nothing errors; the missing configuration just silently uses defaults—or doesn’t exist at all.

Spotting truncation before it ships

There are patterns to watch for as the generation runs and after it completes:

In streaming mode: generation that slows and stops mid-function. Normal generation has a consistent cadence. If you’re watching text appear token-by-token and the stream visibly stalls in the middle of a method body—not at the end of a logical unit—that’s the output ceiling.

Sentences that end mid-thought. The model has a strong completion instinct. It almost never ends on a fragment by choice. If a docstring reads “Validates the incoming request and throws a” and nothing follows, that’s a hard stop.

Unclosed code blocks. A code fence that opens with triple-backtick and never closes. A class that opens and never closes. A function missing its final return and closing brace. These are reliable indicators.

The response is much shorter than the prompt warrants. If you sent a 400-line file and received 200 lines back with no explanation of what was omitted, something stopped the generation.

”…” appearing mid-file. This one is subtle. Sometimes Claude will insert an ellipsis as a shorthand for “there’s more here” when it senses the output is getting long. This isn’t truncation by the ceiling—it’s the model preemptively abbreviating. But it’s a signal that you’re close to the boundary and future requests at the same scale will hit it hard.

The incremental edit pattern

The fix for most truncation is restructuring requests so each turn stays well under the output ceiling.

Instead of:

Rewrite src/auth/auth.service.ts to use the new JWT library.

Break it into:

In src/auth/auth.service.ts, update only the `signToken` and `verifyToken`
methods to use the new jwt library. Leave everything else unchanged.

Then, once that looks right:

Now update the `refreshToken` method in the same file.

Each incremental edit is a fraction of the whole. Even if the full file is 800 lines, an edit that touches two methods generates maybe 60 lines of output plus context. You’re nowhere near the ceiling.

This isn’t just a token optimization. The incremental pattern has a second benefit: you review changes in smaller pieces and catch errors earlier. A bad full-file rewrite requires reviewing everything at once. A bad two-method edit is obvious immediately.

There’s also a subtler reason incremental edits work well: the model is better at editing than rewriting. When you give Claude Code a function that already exists and ask it to modify specific behavior, it can focus the entire output budget on the changed region. When you ask for a rewrite from scratch, it has to regenerate all the stable code it didn’t need to touch, burning output tokens on content that was already correct.

The same applies to test generation. Instead of “write tests for UserService”, try:

Write tests for the UserService.createUser method only. Cover:
- valid input creates a user and returns it
- duplicate email throws ConflictException
- database error propagates correctly

Three tests per turn, reviewed and committed, then move to the next method. Slower in wall-clock time, faster in iterations to a working test suite.

Streaming vs. non-streaming truncation behavior

In Claude Code’s interactive mode, you see streaming output by default. This is actually the more forgiving mode for truncation, because you watch the generation as it happens.

When streaming, truncation looks like: generation stops in a visually obvious place. You’re watching a function appear token-by-token, and it stops at the start of the return statement. You can immediately recognize this as incomplete and re-prompt before you’ve accepted anything.

In non-streaming contexts—API calls with stream: false, or batch operations—the behavior is worse. You get a single returned response. It’s clipped at the output limit with no warning in the response body. The stop_reason field in the API response tells the story: it returns "max_tokens" instead of "end_turn" when generation was stopped by the output limit. If you’re building tooling on top of Claude Code or the Anthropic API, checking stop_reason === "max_tokens" is the reliable way to detect truncation programmatically.

const response = await client.messages.create({
  model: "claude-sonnet-4-5",
  max_tokens: 8096,
  messages: [{ role: "user", content: prompt }],
});

if (response.stop_reason === "max_tokens") {
  // Output was truncated — re-prompt with a narrower scope
  console.warn("Generation truncated. Request was too large for a single turn.");
}

In interactive Claude Code use, you don’t see stop_reason directly, but the streaming behavior makes the truncation visible if you’re paying attention. The rule: if a response ends without natural closure on what you asked, assume truncation and re-scope before building on top of it.

Adjusting the output budget

For API use, max_tokens is a ceiling you set. The default in many SDK examples is 1024 or 4096—well below the model’s actual limit. If you’re generating code and getting surprisingly short responses, check that you’ve set max_tokens to 8096 (or 16000 for models that support it).

Raising max_tokens doesn’t make the model generate more—it removes a hard cutoff that was prematurely ending responses. The model will still stop at a natural point if the task is smaller than the budget.

The converse: intentionally lowering max_tokens can be a useful constraint for exploring solutions. If you set max_tokens: 200 and ask for a refactoring suggestion, you force the model to summarize the approach rather than generate full code. This is occasionally useful for planning before committing to a full implementation.

The pattern that saves the most time

After enough truncation surprises, the workflow shift that helps most is front-loading the scoping conversation. Before sending a request that involves writing a lot of code, ask Claude Code:

What's your plan for implementing this? Give me an outline before writing any code.

The plan is cheap—a few hundred tokens. It tells you how much output the implementation will require. If the plan has 12 steps and each step involves writing a complete function, you know the full implementation won’t fit in one turn. Break the request before you send it, not after you get a truncated response.

Output token limits will expand over time as models improve. Claude 3.7 Sonnet’s extended thinking can already generate substantially more in specialized modes. But the core constraint—there’s a ceiling on what a single response can contain—will persist. Knowing which requests approach that ceiling, and how to structure them around it, is the skill that stays useful regardless of what the specific limit is.