Tinker AI
Read reviews
intermediate 7 min read

What 1M context on Opus 4.7 actually buys you (and what it doesn't)

Published 2026-05-11 by Owner

The pitch is seductive: a million tokens means you can feed the model your entire codebase and it will reason across all of it simultaneously. Finally, no more “the model doesn’t know about that file.” Finally, true global understanding.

That’s not quite how it plays out in practice.

After a few months of using Claude Code with Opus 4.7 on moderately large codebases — 80k to 400k lines, multiple packages, long-running sessions — the picture is more nuanced. The 1M window is genuinely load-bearing in a handful of situations. In most others, it’s overhead you’re paying for without getting much back.

This guide isn’t about whether the 1M context window is a good feature. It clearly is. It’s about the gap between “technically possible” and “worth paying for,” and how to tell the difference before your bill arrives.

The naive expectation vs what actually happens

The mental model most people start with: bigger context = the model sees more = better answers. This is true at the margins but misleads about the mechanics.

A language model doesn’t “read” context the way you read a document linearly and retain everything. Attention is distributed across the full context, but the effective weight of distant tokens decays. In practice, content near the beginning and near the end of the context window gets the most attention. The middle — which is where most of your “paste the whole repo” content lands — gets diluted.

This has a name in the research literature: the “lost in the middle” problem. It was first documented with GPT-4 in 2023, and the current generation of models has improved on it significantly, but it hasn’t been fully eliminated. Opus 4.7 is better than its predecessors at attending to the middle, but if you fill 800k tokens with every file in your monorepo and ask a question about one specific module, you’re not getting 800k tokens of focused reasoning — you’re getting inference that has to compete with a lot of noise.

The upshot: context quality still matters more than context quantity. Focused, relevant context outperforms bloated, comprehensive context even when the window has room for both. Stuffing the window to feel thorough is actively counterproductive beyond a threshold.

What actually degrades with large unfocused context:

  • Instruction following precision (the model’s “attention budget” is spread thin)
  • Response consistency on repeated queries (the model finds different “attractors” in the noise)
  • Cost, which scales linearly with input token count

None of this means the 1M window is a marketing gimmick. It means using it well is a skill.

There’s also a subtle trap in how “bigger context” feels. Loading more files creates a sense of having been thorough. It’s reassuring. That reassurance is somewhat misleading — the model isn’t necessarily making better decisions just because you gave it more to read. The relevant question is always whether the added content is load-bearing for the specific question being asked.

When 1M context is genuinely load-bearing

There are tasks where the large context window earns its keep. They share a common structure: the task requires holding multiple large artifacts in memory simultaneously, and the artifacts are genuinely interdependent.

Cross-file refactors spanning 30+ files. Renaming a core abstraction, changing a function signature that threads through many call sites, splitting a module that half the codebase imports — these are tasks where the model benefits from seeing all the affected files at once. With a 32k or 128k window you’re forced to work in passes, each pass losing context about what was decided in the previous one. With 1M, you feed it the full change surface upfront and the model can reason about the complete impact before touching anything.

An example workflow that works well:

# Use fd or ripgrep to assemble the exact files that need changing
fd -e ts -e tsx . src/ | xargs wc -l | sort -rn | head -40
# Pipe the relevant files into your context — not everything, just the affected surface

The key is intentional scoping: you’re not pasting the whole repo, you’re pasting the 30-40 files that actually matter for this change. 1M makes this comfortable even when each file is substantial.

Multi-document synthesis. RFC + design doc + existing implementation + test suite, all at once. If you’re asking the model to check whether a new implementation matches its stated design, or to identify gaps between a spec and what was built, it needs to hold all four documents in context simultaneously. At 128k this requires compression and summarization; at 1M you can include them whole.

Long conversation continuity. A session that runs for three or four hours accumulates a lot of context — the planning turns, the intermediate outputs, the corrections, the code diffs. At 1M, an afternoon of work stays coherent. The model remembers what was decided two hours ago without needing a re-brief. This is a subtler benefit than the “giant file dump” framing, but it’s probably the one I rely on most often day-to-day.

Before 1M context became available, long sessions required periodic re-briefs — summarizing the current state of work, what had been done, what decisions had been made, and passing that summary in as the new context for the next phase. With 1M, that overhead disappears. The session accumulates naturally and the model’s understanding of “where we are” updates from the actual conversation rather than from a summary you wrote. Summaries introduce distortion; the actual conversation doesn’t.

When 1M is wasted

The inverse cases are worth being explicit about, because the temptation to over-provision context is real.

Single-file edits. If you’re asking the model to refactor one file, add error handling to one function, or fix a bug in one component, 1M context is overhead. The model doesn’t need it and you’re paying for it anyway because pricing is per input token.

Short tasks. “Write me a utility function that does X.” “Translate this comment to English.” “What’s wrong with this SQL query?” These tasks don’t benefit from global codebase context. They’d be solved correctly with 8k tokens. Routing them to Opus with 1M context is expensive by roughly an order of magnitude compared to running them on Sonnet.

Repeated commands. If you’re running a loop that asks the model to generate 50 fixture files from a schema, each turn is independent. Each turn inherits the full context from all prior turns, but doesn’t need it. The cost compounds fast.

“Context as comfort.” I’ve caught myself adding extra files to context not because they’re relevant but because I’m uncertain — maybe the model will find something I missed. Sometimes this is reasonable. More often it’s anxiety-driven provisioning. If you can’t articulate why a file needs to be in context, it probably shouldn’t be.

Cost and prompt caching: why 1M is usable at all

At Opus pricing, naively sending 1M tokens on every turn would be prohibitively expensive. What makes it workable is prompt caching.

The pattern in Claude Code sessions: the large stable context (your files, your plan, your conventions) gets cached after the first turn. Subsequent turns pay only for the new tokens added — typically the latest exchange plus whatever new content you’ve introduced. The cost of re-sending the preamble on turns 2 through N collapses to a cache hit fee, which is substantially cheaper than the full input token rate.

The practical effect: a session that starts by loading 500k tokens of context is expensive on turn 1 and then relatively cheap on turns 2-20, because the cached preamble is nearly free to “re-read.” The cost curve is front-loaded, not linear across turns.

This also means session architecture matters. It’s better to load your full context once at the start of a session and work within it than to load partial context repeatedly across many short sessions. The first approach pays one large cache-fill cost; the second pays the full input rate repeatedly without the caching benefit kicking in.

For Claude Code specifically: the tool handles session state automatically, so you don’t manage cache headers directly. But you do control session boundaries. Starting a new conversation means losing the warm cache and paying the full load cost again. When you’re in the middle of a large refactor, staying in the session is cheaper than breaking out and starting fresh.

A few session hygiene habits that matter more at 1M scale:

  • Front-load stable content. Files that won’t change — reference docs, existing implementation, design specs — should go in before any rapidly-changing content. The cache invalidates from the point of change onward, so anything after a changed block gets re-priced at full rate.
  • Don’t restart sessions mid-task out of habit. Some people restart Claude Code sessions frequently just to “feel fresh.” At 128k context this was occasionally reasonable; at 1M it’s expensive. The context accumulation from a long session is a feature, not a problem to reset.
  • Large files that repeat across turns compound fast. If you’re loading a 50k-token file on every turn because you’re not sure if it’s needed, the first turn caches it cheaply but the cache TTL matters. Check your actual usage patterns.

A concrete routing heuristic: when to drop to Sonnet

Opus 4.7 with 1M context is the right tool for planning and synthesis turns. It’s not always the right tool for execution turns.

The split that works in practice:

Keep on Opus: Any turn where the model needs to hold the full picture in mind — the initial analysis, architectural decisions, the final integration review. These are the turns where the model’s reasoning quality directly determines outcome quality.

Drop to Sonnet (4.6) for: Mechanical execution once the plan is clear. Generating boilerplate from a spec. Filling in test cases from examples. Writing docstrings for a list of functions. Translating a design into implementation when the design is already specified. These turns are guided by the plan established in the Opus turns; the cheaper model follows instructions competently without needing the full analytical horsepower.

The rough test I use: if removing 80% of the context would make the turn noticeably worse, stay on Opus. If the model is basically following a checklist and the quality ceiling is in the clarity of the instructions rather than the model’s reasoning, Sonnet is fine.

A worked example: you’re migrating a codebase from a deprecated internal library to its replacement. The migration guide and the affected files together are 200k tokens.

  • Turn 1 (Opus, full context): Load the migration guide, the deprecated API surface, and 30 affected files. Ask for a migration plan that identifies patterns, edge cases, and ordering dependencies.
  • Turns 2-8 (Sonnet, scoped context): Take each affected file, load it plus the migration plan, and execute the migration for that file. Each turn is maybe 20k tokens instead of 200k. Sonnet can follow the explicit plan.
  • Final turn (Opus, full context again): Load all migrated files alongside the migration guide. Ask for a final consistency check — does the implementation match the spec, are there any files that were missed, does the new usage pattern look right throughout?

The cost difference between running this workflow vs running everything on Opus at full context is substantial. Most of the migration work (turns 2-8) runs at a fraction of the full-Opus-full-context rate. You’re paying Opus rates only for the two turns where Opus’s reasoning depth actually matters.

This is the architect/editor split that shows up in the Cline Plan/Act pattern and elsewhere. The expensive model sets direction; the cheaper model does the legwork. You’re paying Opus rates for the turns where that matters and Sonnet rates for the turns where it doesn’t.

The earned insight most guides skip

The real unlock with 1M context isn’t that you can load more stuff. It’s that you can stop managing context manually.

With a 32k or 128k window, a meaningful part of the engineering work in an AI-assisted session is context curation: deciding what to include, deciding when to summarize and compress, deciding when to start a fresh session and re-brief from scratch. This curation is cognitive overhead that doesn’t directly produce code.

With 1M context, the threshold where you have to think about context management moves far enough that most sessions never reach it. You load what’s relevant, you work, the session stays coherent, and you don’t spend ten minutes deciding which files to cut before asking your question.

That’s not nothing. The value isn’t that you’re doing smarter reasoning across a million tokens — it’s that you’re spending less energy on a task (context curation) that shouldn’t require engineering judgment in the first place.

The ceiling is still there, though. On the largest codebases — multi-repo enterprise setups, projects with millions of lines — even 1M gets tight. At that scale you’re back to curation, just with more headroom before it becomes necessary. And the fundamental tradeoff between context breadth and attention quality doesn’t disappear just because the window is large.

Use the window for the tasks that genuinely need it. Route everything else to the cheaper, faster model. The session architecture is more important than the raw context size.