Context window compression: the techniques and what they cost
Published 2026-05-11 by Owner
The context window is a fixed budget. Everything the model can “see” — your instructions, the conversation so far, file contents, tool outputs — has to fit inside it. When a session runs long, that budget fills, and the model loses access to earlier content.
Most developers hit this on their third or fourth AI coding session: the assistant starts forgetting what was decided twenty turns ago, hallucinates a function name that was defined earlier, or refuses further requests because the window is full. The usual response is to try to “compress” the context somehow. There are three ways to do that. All three lose information; they just lose different information.
Why context fills up faster than expected
The obvious cause is a long conversation. But long conversations are often not the main driver. In AI coding workflows, a few specific patterns eat context much faster than turn count alone would suggest.
Pasting large files. A 500-line config file or a full API response is cheap in time but expensive in tokens. That content stays in context for every subsequent turn — even after the conversation has moved on entirely. The paste is a one-time action; the token cost compounds.
Verbose tool output. Grep results, test output, and find results can run to thousands of tokens. A single test
suite run with full output can fill a quarter of a 32k context window. Repeated runs — run, fix, run again — stack
those costs. Five test cycles with unfiltered output and the test logs alone can exceed the budget.
Loading files that never get released. When a coding assistant reads a file, the contents enter context. If the session continues without explicitly dropping the file, those tokens stay resident regardless of whether the current task still needs them. Open five large source files across the course of a session and they remain loaded, occupying context for turns that have nothing to do with any of those files.
Tool call metadata. Every tool invocation — file reads, searches, command runs — adds not just the output but structured metadata: tool name, parameters, call status, return type. These accumulate across a long session and can account for a meaningful fraction of total context usage even before counting actual output.
A 32k context in a paste-heavy session with verbose tool output can fill in under an hour. A 128k window buys more runway, but the dynamics are identical. It fills faster than most people expect because the cost-per-turn is higher than the conversation itself suggests.
The three compression strategies
Summarization
The model (or the tool) replaces some or all of the conversation history with a condensed summary. Something like: “We discussed implementing the auth middleware. The agreed approach was X. The relevant files are Y and Z.”
What it loses: specific decisions. The summary keeps the conclusion but drops the reasoning. If the rationale for a choice was “we ruled out approach B because of the edge case in the webhook handler,” the summary might only retain “approach A was chosen.” Three turns later, when an edge case surfaces that resurrects approach B, the model has no recollection of why it was ruled out. The conclusion is present; the argument for it is gone.
Summarization also collapses code snippets into descriptions. “The function was updated to handle null” versus the actual before/after diff. Descriptions lose the specifics that matter for follow-on work.
When it’s acceptable: exploratory sessions where the early turns were genuinely tentative — you were figuring out what the problem was — and the actual decisions are all in the recent half of the conversation. The early turns being summarized away doesn’t hurt much if they were exploratory rather than load-bearing.
Drop-old (sliding window)
The tool discards the oldest N turns entirely and keeps only the recent history. No summary; those turns are gone.
What it loses: all backward references. If a constraint was established in turn 3 — “never mutate the input array, the caller owns that memory” — and the session is now on turn 45, that constraint is gone from context. The model will reintroduce mutations without knowing it’s violating an established decision. Later turns that implicitly rely on earlier ones break silently: the model doesn’t know it’s missing something, so it proceeds confidently on incomplete information.
The failure mode of drop-old: errors don’t look like context errors. They look like the model made a wrong choice. Without knowing the earlier turns were dropped, the wrong choice is hard to explain.
When it’s acceptable: sessions that are genuinely sequential and stateless — each turn is self-contained, and later turns don’t rely on specifics from much earlier. This is rarer in coding work than it sounds. Most coding sessions accumulate decisions that compound on each other.
Condense-old (compressed history)
A middle path: older turns are compressed into shorter representations rather than summarized into a single block or dropped entirely. Each turn is retained but shrunk. A turn that was 800 tokens becomes 150 tokens.
What it loses: voice and texture. The compression strips filler, trims examples, and reduces elaboration. What remains is factual but stripped of context — why something was phrased a particular way, which parts were tentative versus confident. For coding this matters less than for creative work, but it does affect the model’s ability to reconstruct why a decision was made if the rationale lived in casual discussion rather than an explicit statement.
Condense-old preserves more specifics than summarization (individual turns still exist, just shorter) and more continuity than drop-old (nothing is deleted). It’s also the most computationally expensive, since each turn needs its own compression pass.
Per-tool support
Different tools handle this differently, and the differences matter for how much control you have.
Cursor applies compaction automatically as context fills. It kicks in invisibly — no user-facing toggle as of mid-2025. Behavior varies by model; some handle long context better than others. You don’t know compaction happened until the model is missing something it should know.
Cline lets you manually trigger compression from the UI. The sidebar shows current context usage as a percentage; a Compact button appears when usage is high. Auto-compaction is also available as a setting. Cline compacts against the whole conversation rather than a sliding window, so older turns get summarized rather than dropped entirely. Manual control is an advantage: you can compress at a logical stopping point rather than mid-thought.
Aider takes a different approach. Context is managed by tracking which files are active in its file list, and
the /drop command removes files explicitly. Conversation history is less of a concern because each Aider command is
relatively self-contained. When context fills, the solution is to drop files, not compress conversation turns. File
context is managed explicitly rather than compressed implicitly — a different mental model, but a consistent one.
Raw API / Claude.ai give full control with no built-in compaction. The window fills and requests start failing once it’s exceeded. Most transparent (you always know exactly what’s in context), most demanding (you manage it).
The practical tradeoff: auto-compaction is convenient but opaque. You often discover what was lost only when the model does something confusing that makes sense if you imagine a piece of context was stripped — and that discovery comes after the fact.
The workflow that avoids needing compression
The most reliable solution is not to need compression at all.
One session, one task. Start a fresh session for each discrete task. “Add the payment webhook handler” is a session. “Refactor the auth module” is a different session. Don’t let the webhook session grow into a general-purpose coding session that accumulates context from fifteen sub-tasks. A clear scope fits comfortably inside the window at normal density.
Close when done. When a task is complete, close the session. The temptation is to continue because the context is “warm” — the model knows the codebase, files are loaded, conventions have been established. This is the most common trap. A new session with a brief preamble (“working in this repo; here’s the relevant background”) starts clean without carrying the weight of the previous task.
Paste selectively. Paste the specific 20-line function, not the 500-line file it lives in. Let the model read files via tool calls when it needs more, rather than front-loading everything. Even tool-read files count against context, so load only what the current task genuinely requires.
Trim verbose output. If a test run is failing, paste the relevant failure messages, not the full test log. Grep results should be trimmed to the relevant lines. Extra work upfront, but it compounds: a session that consistently trims output stays lean across twenty turns rather than ballooning by turn five.
A disciplined session that follows these patterns rarely needs compression. The average coding task — “implement this function,” “debug this test failure,” “explain what this code does” — fits comfortably in a 32k context if the session is kept focused. The failure mode is treating the AI assistant like a persistent process that should maintain awareness of everything worked on across multiple tasks. The context window is a working surface, not memory. When the task changes, clear the surface.
What compression does not fix
There’s a subtler problem compression cannot address: even before the window fills, quality degrades at high context utilization. Models pay less attention to content far from the current turn. Decisions made twenty turns ago are technically “in context” but practically invisible — the model’s attention is spread across thousands of tokens and older content receives less weight.
The useful limit is lower than the technical limit. A 128k window at 80% capacity is already showing effects: the model forgets things that are technically accessible. Compression at that point buys headroom, but the attention-dilution problem was already present before compaction ran.
Window size is a ceiling, not a target. The real number to manage is how much context the current task actually requires — and for most well-scoped coding tasks, that number is much smaller than the window’s limit. Compression is a recovery tool. A focused session is a prevention strategy. Prevention is cheaper.