Prompt caching on Claude Code: the 5-minute TTL and how to actually save money

Published 2026-05-11 by Owner

Claude Code bills you on input tokens. For a long session, the preamble — CLAUDE.md, system prompt, conversation history — gets re-sent on every turn. Without caching, a 10k-token context costs 10k tokens per turn. With caching, the first turn sets a breakpoint and subsequent turns within the cache window pay a fraction of that.

The fraction is meaningful: Anthropic’s cached input price is roughly 10% of the uncached rate on Sonnet 3.7. On a multi-hour session, the delta adds up fast.

But the cache isn’t persistent. It has a 5-minute TTL, and most cost surprises in Claude Code trace back to not knowing what that means in practice.

How the cache actually works

When Claude Code sends a request, Anthropic’s API looks at the prefix of the prompt — the part that’s stable across turns. If the prefix matches a previously cached snapshot and the snapshot is still within its TTL, that prefix is served from cache and billed at the reduced rate. Only the new content in the request (the latest user turn) pays full price.

Claude Code doesn’t expose the raw API caching controls to users — it manages cache breakpoints automatically. What you can control is whether the prompt prefix stays stable enough across turns for the cache to actually hit.

The cache hit information surfaces in Claude Code’s session stats. After a session, the output includes something like:

Cache stats:
  Cache creation tokens: 12,847
  Cache read tokens:     89,203
  Uncached input tokens: 4,112

Cache creation tokens are billed at 1.25x the base rate (you’re paying to write the cache). Cache read tokens are billed at ~0.1x. Uncached input tokens are billed at 1x. A session where cache reads dwarf uncached tokens is a session that’s running efficiently.

A rough benchmark: if cache read tokens are at least 5x your uncached input tokens across a session, the cache is doing useful work. If they’re closer to 1:1, something is disrupting cache hits.

The 5-minute TTL and what it actually means

The TTL starts when the cached prefix is created and resets on each cache hit. Back-to-back turns — you send a message, Claude responds, you send the next one within a few minutes — stay warm. The prefix is reused and the discounted rate applies.

The problem is gaps. Step away from the keyboard for 10 minutes to run tests, answer a message, or context-switch. When you come back, the cache has expired. The next turn re-sends the full preamble at uncached rates, and a new cache entry is created at the 1.25x creation rate.

This plays out worse than it sounds in practice because CLAUDE.md files, system prompts, and conversation history tend to grow over a session. A 4k-token CLAUDE.md plus a few rounds of conversation context might push the preamble to 15k tokens. Every cold restart pays 15k tokens at full price before the new cache entry gets established.

The pattern worth internalizing: the cost-per-turn during a warm session is low; the cost of the first turn after a cache miss is high. A session with five 10-minute gaps is materially more expensive than a session of the same length without gaps, even if the same work gets done.

Patterns that destroy cache hits

Dynamic content near the start of the prompt. If anything in your CLAUDE.md changes between turns — even a date stamp, a counter, or a conditional block that evaluates differently — the prefix diverges and the cache misses. Cache breakpoints work on exact-match prefixes. One character different, no hit.

I’ve seen repos where CLAUDE.md reads the git SHA or current branch into some dynamic section. Every turn gets a fresh cache miss. The file looks reasonable from a human perspective but it’s cache poison.

Rotating context injections. Some teams append the current time, sprint name, or random tip-of-the-day to their system prompts. These feel like harmless personalization but they invalidate the cache on every turn.

Frequent task-switching within a session. Claude Code’s context carries all the conversation history. If you work on auth for 20 turns, then switch to a UI task for 20 turns, the context for the auth work stays in the preamble during the UI section. That’s not necessarily bad — it’s just expensive. Switching between unrelated tasks within a single session tends to grow the context faster than staying on one track, which means larger preambles and larger cache misses when the TTL expires.

Very short turns in sequence. Sending a one-line prompt, reading the response, sending another one-line prompt creates frequent cache writes rather than cache reads. The cache hits the 5-minute wall more often relative to the amount of work done because the session is churning through many short turns.

A workflow that keeps cache hit rates high

The mental model that helps most: treat the cache like a warm database connection. Once it’s open, keep it open. Queries are cheap. Reconnecting is expensive.

Concretely:

Front-load stable content. Keep CLAUDE.md stable across a session. Do not inject git SHAs, timestamps, or anything that changes turn-to-turn into the top of the file. Volatile instructions (e.g., “today we’re working on the payments module”) belong in your first user turn, not in the static system file.

Keep CLAUDE.md content ordered by stability. The cache breakpoint is anchored to the prefix. Put content that never changes at the top (coding standards, project structure, test conventions). Put content that sometimes changes near the bottom. A change to the bottom of the file only invalidates the portion after the changed line; a change to the top invalidates everything.

Take fewer, longer turns. One turn with a complete, multi-part request is cheaper than five turns asking one piece at a time. The first turn pays the same either way; the follow-up turns on a single turn hit cache reads. Spreading the same work across five separate turns means more cache creation events and more exposure to TTL expiry between them.

Batch the planning upfront. Before you start a long task, spend one turn establishing context: the goal, the files involved, the constraints, the approach. This seeds a rich prefix that subsequent turns can cache against. Doing the planning in pieces over 20 minutes means paying cache creation costs repeatedly.

Work one task to completion before switching. Within a session, context-switching between unrelated tasks balloons the context and doesn’t reduce cost the way starting a fresh session would. If a task is truly unrelated to what you’ve been doing, close the session and start fresh rather than appending to a context that’s already carrying irrelevant history.

Know when to close and reopen. If a session’s context has grown to 80k tokens across multiple unrelated tasks and you’re starting something new, a fresh session is cheaper than continuing. The fresh session pays full preamble cost once; continuing the old session pays for 80k tokens of irrelevant history on every subsequent turn.

Reading the session stats

Claude Code surfaces cache stats at the end of a session and sometimes mid-session on longer runs. The numbers to track:

Cache creation tokens — unavoidable on the first turn of a session or after a cold restart. High creation tokens relative to reads means lots of cache misses.
Cache read tokens — what you want more of. These are billed at roughly 10% of normal.
Uncached input tokens — new content per turn. Should be proportionally small relative to reads in a long session.

If cache creation tokens and cache read tokens are roughly equal across a multi-hour session, the cache is expiring and being recreated frequently. The 5-minute TTL is the usual cause. Pacing turns more tightly or reducing gap time between turns will shift that ratio.

If uncached input tokens are surprisingly high, the prompt prefix isn’t staying stable. Check whether CLAUDE.md has dynamic content, or whether a hook or tool injection is changing something near the top of the context.

What this doesn’t fix

Prompt caching reduces the cost of repeating context you’ve already paid for. It doesn’t reduce the cost of genuinely new work. A fresh task that requires reading 15 files, running 20 tool calls, and iterating over several long responses will still be expensive — caching just means the growing conversation history doesn’t multiply that cost.

And caching only helps within a session. Closing Claude Code and reopening it the next day starts from zero. There’s no cross-session cache persistence. If you’re working on the same codebase daily and find yourself paying to re-establish context from scratch every morning, that’s not a caching problem — that’s a context-size problem that better CLAUDE.md structure or project-level memory tooling addresses differently.

The prompt caching story in Claude Code will likely improve as Anthropic extends TTL options and surfaces per-turn cache stats more prominently. The current 5-minute window is a constraint to work within, not a fixed ceiling.