The context window arms race ended; nobody noticed

In 2023, “context window” was the spec everyone talked about. Claude 2 went from 100k to 200k tokens; that was news. Gemini 1.5 hit 1M tokens; that was a bigger story. The implication: bigger windows would unlock new capabilities.

Three years later, windows range from 128k (most models) to 2M (Gemini), and the differences barely matter for typical coding work. The arms race quietly stopped mattering. Here’s why and what does matter now.

Why bigger windows promised more

The pitch in 2023: with enough context, you can put your whole codebase in the prompt, and the model can reason about everything at once. No more chunking, no more retrieval, no more “wait, the model can’t see X.” The whole world fits in the window.

This was true in a literal sense — a million tokens really can hold most small-to-medium codebases. But the workflow promise didn’t materialize the way the spec suggested.

Why bigger windows don’t change much

Three reasons the bigger windows haven’t been transformative:

Effective attention degrades faster than total context grows. A 1M-token model is meaningfully worse at finding relevant info in token 800,000 than at token 50,000. Empirical “needle in haystack” tests show degradation across all models past a few hundred thousand tokens. The full window is technically usable; the practical sweet spot is much smaller.

Cost scales linearly. Loading 500k tokens of context costs 5x what loading 100k costs. For most tasks, the larger context isn’t 5x more useful; it’s just more context. The marginal value drops faster than the marginal cost.

Latency scales too. Time-to-first-token is roughly linear in input size. Loading 500k tokens means 5-15 seconds before the first response token. Interactive workflows want sub-second TTFT. Large contexts don’t give you that.

Retrieval-based approaches got better. Indexing-based context (Cursor’s codebase index, Cline’s selective retrieval, embeddings-based search) got good enough that loading the whole codebase is rarely necessary. You can get the relevant 50k tokens without paying for the irrelevant 950k.

Where bigger windows still matter

A few specific use cases where window size genuinely matters:

Cross-cutting refactors. When the change has to be applied consistently across many files, having all the files in context helps the model produce a complete change list. Indexing-based approaches sometimes miss occurrences.

Tracing through unfamiliar codebases. “Where does this flow go?” When you don’t know the relevant files, loading more context lets the model find the path.

Documentation generation that spans many files. Producing a README, architecture doc, or design doc that synthesizes information from many sources. The synthesis benefits from the model seeing everything at once.

Long-running conversations. Multi-turn agent loops accumulate context. Larger windows mean longer sessions before you have to compact. For Cline’s autonomous loops in particular, the difference between 200k and 1M is “session lasts an hour” vs “session lasts four hours” before compaction.

These are real but specialized. Most coding work doesn’t need them.

What does matter now

Instead of context window size, the model attributes I notice mattering most:

Quality of attention within a moderate window. A 200k window where the model can effectively use all 200k is better than a 1M window where it gets lost past 300k. Claude’s attention quality across its full window has been a meaningful advantage.

Instruction following. Some models follow long, structured instructions reliably. Others don’t. For tasks like “follow these 30 rules carefully,” instruction-following dominates.

Tool use reliability. For agentic workflows, the model has to emit structured tool calls correctly every turn. Models with high tool-use fidelity (Claude, GPT-4o) outperform models that occasionally produce malformed calls.

Code-specific training. The flagship models all do code well. The differentiation is in specific languages and specific patterns. Some models are stronger on Rust; some on niche frameworks; some on SQL. The model’s exposure during training matters more than the window size.

Cost per token. When the workflow doesn’t benefit from massive context, lower price per token translates directly to more sessions per budget. Cheap models with reasonable windows beat expensive models with huge windows for most workloads.

The KV cache angle

One technical advance that did matter, even though nobody markets it: prompt caching.

Claude’s prompt caching, OpenAI’s similar feature, and equivalent capabilities at other providers let you cache the prefix of a prompt — your codebase context, your system instructions — and reuse it across queries. The cached portion is much cheaper to “load” again.

This effectively makes large contexts cheaper for repeated use. If your conversation builds on the same 100k tokens of code, the second message reads from cache; the cost is much lower than the first.

Caching is the under-marketed feature that’s done more than window-size growth to make large contexts practical. It’s also the feature that’s hardest to use correctly — the cache has invalidation rules, and a small change to the prefix can blow the cache and re-cost the whole thing.

What the marketing still says

The marketing for new model launches still leads with context window. “2M tokens” sounds like a meaningful upgrade. In practice, the difference between 200k and 2M for typical coding work is negligible.

This isn’t malicious; it’s because window size is a clean number that’s easy to compare. Attention quality, instruction following, and tool use reliability are harder to communicate. Vendors will continue to advertise the easy number.

The user-side translation: when a new model launches with a bigger window, ignore the headline number. Test the model on your actual workflow. If it works better, it’s because of attention quality, not window size. If it doesn’t, no window size will help.

What I’d watch for

A few things that would actually change my workflow:

Faster TTFT on large contexts. Sub-second TTFT on 100k+ token prompts would change interactive use. Today’s TTFT scales roughly linearly with input.

Better effective attention past 500k. If a model could maintain quality across a true 1M window (not just nominal), some specialized workflows would unlock.

Cheaper extended caching. Longer cache TTLs and lower cached-token prices would make “always have my codebase in context” more practical.

These would matter. Bigger nominal windows, on their own, won’t.

The broader pattern

The window arms race is one example of a broader pattern: AI specs that grab attention turn out not to be the operative variables for most workflows. RLHF replaced fine-tuning as the marketing focus; chain-of-thought prompting became “thinking” as a model feature; mixture of experts; etc.

Each of these matters for some workflows. None of them, on their own, is the differentiator for most users. The actual differentiators — instruction following, attention quality, tool use, cost — are harder to market because they don’t reduce to a single number.

For deciding which model to use, ignore the headlines. Run the model on your actual work for a day. Notice what works and what doesn’t. The choice is rarely about the spec the marketing emphasizes.