Token economics for AI coding: per-model cost curves and where they break your budget

Published 2026-05-11 by Owner

Token pricing for AI coding is no longer a footnote. A well-configured autonomous session on a flagship model can spend $5–10 without doing anything obviously wrong. That’s fine if you know what you’re buying. It’s less fine when the spend is invisible until the end-of-month invoice arrives and you notice that three weeks of casual AI-assisted coding cost as much as a SaaS subscription you’d scrutinize.

This is a breakdown of where the money goes — per model, per turn, and per session — with enough concrete detail to make intentional tradeoffs rather than guess-and-check adjustments.

2026 price landscape

Prices shift constantly, so the useful unit is ratios rather than absolute figures. As of mid-2026, rough per-million-token costs (input / output):

Model	Input	Output	Ratio vs. Haiku
Claude Opus 4.7	~$15	~$75	15× input, 15× output
Claude Sonnet 4.6	~$3	~$15	3× input, 3× output
Claude Haiku 4.5	~$1	~$5	baseline
GPT-5	~$5	~$20	5× input, 4× output
Open source (self-hosted)	infra only	infra only	~0 marginal

The output/input ratio is what matters in autonomous coding. Opus charges 5× more for output than input. Every file the model writes, every tool call response it generates, every line of explanation it produces — those are output tokens. On Haiku, the same ratio holds but at a price floor where it barely registers. On Opus, it concentrates cost onto the most common operations in an agentic loop.

One asymmetry worth understanding: thinking tokens on models that support extended reasoning (Opus 4.7, comparable o-series models) count as output tokens. A single planning pass with a generous thinking budget can generate 10,000–30,000 tokens before a single word of visible code appears. At Opus output pricing, 20,000 thinking tokens cost $1.50. That’s one planning step.

Open source self-hosted models (Qwen-72B, Llama-3-70B, DeepSeek Coder V3 and similar) have near-zero marginal cost once the inference server is running. The economics invert: cost is a sunk infrastructure expense, and the binding constraint becomes throughput and latency, not per-token spend. For teams already running GPU inference, routing lower-stakes coding work to self-hosted models is straightforward. The tradeoff is quality ceiling and setup maintenance, not economics.

How cost scales with tool turns

This is the part most explanations skip, and it’s the one that produces the most surprise when the bill arrives.

A single autonomous task isn’t one request — it’s a sequence of turns: read file, analyze, propose change, write file, run tests, read test failure, fix, run tests again. Each turn sends the full conversation history to the model. That history grows with every turn.

If turn 1 sends 10k tokens, turn 2 sends 12k (10k history plus 2k new context), turn 3 sends 15k, and so on, the cost is triangular, not linear. A 10-turn session with 3k tokens of new context per turn looks like this:

Turn  1: 10k input  →  cost 1.0×
Turn  3: 16k input  →  cost 1.6×
Turn  5: 22k input  →  cost 2.2×
Turn  7: 28k input  →  cost 2.8×
Turn 10: 37k input  →  cost 3.7×
Total input:    ~235k tokens
Naive estimate: ~100k tokens (10 × 10k flat)
Actual multiplier: 2.35×

The actual multiplier grows with session length. A 20-turn session with the same growth rate hits a 4× multiplier. A session where tool outputs are verbose — full grep results, complete file contents, long test output — might add 8k–15k tokens per turn instead of 3k, which pushes the multiplier toward 6× or higher.

This also means that task complexity multiplies cost nonlinearly. A task that takes 8 turns costs roughly twice as much as two 4-turn tasks producing the same output, because the longer task carries the full accumulated context into each of the later turns. Short, focused tasks with clear stopping points are cheaper than one long session covering the same ground.

One project’s monthly bill, broken down

A real month of daily coding on a TypeScript monorepo using Sonnet 4.6 — approximately 25 sessions of 20–30 minutes each, a mixture of feature work and bug fixes:

Preamble overhead (~22% of total): Each session begins with the tool loading the project’s context file, scanning relevant directory structure, and reading 2–5 files before the first substantive output. The project’s context file ran to about 900 words. Loading it 25 times costs a few cents at Sonnet pricing. But combined with file reads, directory scans, and configuration loading, preamble accounted for roughly a fifth of the monthly spend. The surprise wasn’t the total — it was how much of it was redundant. Files were read in full when only a function signature was needed.

Thinking tokens (~31% of total): Extended reasoning was on by default, with no budget cap set. For complex tasks — architectural changes, debugging multi-file failures — the model spent 8,000–25,000 thinking tokens before producing output. At Sonnet output pricing of $15/1M, a session with two complex tasks averaging 15,000 thinking tokens each costs about $0.45 in thinking alone, before any code generation. Across 25 sessions, thinking tokens were the single largest cost category.

Tool outputs in context (~28% of total): This was the most surprising category. Grep results across a large codebase return hundreds of matching lines. File reads include every comment, blank line, and import declaration even when only one function’s body mattered. Running tests returns the full test runner output — 200+ lines including passing tests, timing data, and coverage summaries — when only the failure message was needed. Each of these outputs lands in context and stays there for the rest of the session. Over 25 sessions, verbose tool outputs cost nearly as much as code generation itself.

Code generation (~19% of total): The thing that looks like “the work” — files written, diffs applied, suggestions accepted — was the smallest cost category. The expensive parts are setup and observation, not output.

The implication is that optimizing for fewer output tokens (shorter generated code, more concise suggestions) has lower ROI than optimizing for slower context growth. Smaller tool outputs, tighter preambles, and capped thinking budgets compound across every turn of every session.

The three biggest cost sinks

Large preambles. A project context file that describes the full stack, all conventions, error-handling philosophy, deployment process, and test patterns is more expensive than it looks at any individual load. At Opus pricing, loading 3,000 words 20 times in a day is 60,000 input tokens daily — about $0.90. At Sonnet that’s $0.18, which sounds fine until it’s a month of sessions on a team of four.

The fix isn’t to delete the context file; it’s to keep the always-loaded portion under 500 words and move detailed, rarely-needed content to per-file or per-directory rules that only load when actually relevant. Some tools support this natively with glob-scoped rules. Where they don’t, explicit structure in the main file — “if working on the payment module, read docs/payment-architecture.md” — achieves a partial effect.

Thinking tokens. Extended reasoning on a task that doesn’t need it is expensive and slow. A 20,000-token thinking pass on “rename this function across the codebase” wastes $0.30 on Opus and adds 30–60 seconds of latency for no benefit. Most tools that expose thinking mode let you cap the budget per request or disable it session-wide.

The practical default: thinking-off for mechanical tasks (renaming, reformatting, boilerplate generation), thinking-on for debugging non-obvious failures and architectural decisions. Models with per-request thinking control give finer-grained management than session-level toggles.

Tool-spew. Verbose grep results, complete file reads when only a function signature was needed, full test output when only pass/fail mattered — these inflate context at the fastest rate of any single factor. Where tools allow output configuration, use it: grep -l to return file names only, head or line-range reads instead of full-file reads.

Where the tool doesn’t expose this, instructions in the prompt help: “list only files that match, not line content” or “show only the failing test names and error messages, not the full output.” This requires some experimentation per tool, but halving the average tool output size effectively halves the context growth rate for the session.

A budget-tracking habit

The model provider’s usage dashboard updates daily, sometimes near-real-time. Checking it takes about 30 seconds. The useful signal isn’t the absolute dollar amount — it’s deviation from the expected range.

A session that should cost $0.60 showing $4.20 means something specific went wrong: thinking ran uncapped, a tool returned a massive payload that propagated through the rest of the session, or the agent looped on a fixable error for many extra turns. All three are diagnosable if caught the same day. After a week, the session is too far back to reconstruct reliably.

The habit that works: check the usage dashboard at the end of each working day, not after each session. Daily granularity is enough to catch drift. If the 7-day rolling average climbs without a proportional increase in work shipped, something changed — a new context rule loading more files, a switch to a more expensive model, a task type that doesn’t suit the agent loop well.

Monthly budget caps at the API level are useful circuit breakers, but they’re not a feedback mechanism. Hitting the cap stops the bleeding; it doesn’t explain what caused it. Daily checks are the feedback mechanism. The combination of both — a hard cap so a runaway session can’t drain a month’s budget overnight, plus a daily review to catch gradual drift — covers the two distinct failure modes: sudden spikes and slow accumulation.

One rough calibration point: a productive 25-minute session on Sonnet 4.6, moderate task complexity (3–5 files changed, 10–15 tool calls, 1–2 test/fix cycles) should land between $0.40 and $1.20. Sessions consistently above $2.00 on Sonnet without correspondingly complex work are worth investigating — the context is probably growing faster than it needs to.

The same session profile on Opus 4.7 multiplies that range by roughly 5×, which is why Opus is best reserved for tasks where the quality difference is perceptible and material, not used as the default model for everything. For routine tasks — familiar patterns, well-scoped changes, mechanical edits — Haiku or Sonnet will produce the same result at a fraction of the cost. The tier system only makes economic sense if the routing is intentional rather than defaulted to the highest-capability option available.