The Flash that got expensive

For two years “Flash” was the word Google used for the model you could call without thinking about the cost. Gemini 3.5 Flash, released at I/O on May 19, keeps the speed and quietly retires the rest of the meaning. It lists at $1.50 per million input tokens and $9 per million output — three times the rate of Gemini 3 Flash Preview and six times 3.1 Flash-Lite. The name still says cheap. The invoice says otherwise.

The number under the headline

The launch coverage led with speed. Google says 3.5 Flash produces output tokens about four times faster than competing frontier models, and the agentic benchmarks it published — 76.2% on Terminal-Bench 2.1, 83.6% on MCP Atlas — are real improvements. None of that is in dispute. What got buried is that the price moved in the opposite direction from every prior Flash release. We are trained to read “Flash” as a budget signal, so a Flash launch reads as good news for the bill by default. This one isn’t, and the framing is doing a lot of work to keep you from noticing. I saw the launch summarized as a “faster and cheaper” model more than once this week; it is faster, and it is not cheaper, and the gap between those two facts is exactly where overspend lives.

I keep coming back to a point from the honest token bill: the cost of running models is real and rising, and most of the industry prices as though it were flat or falling. A Flash tier that triples in price is the cleanest evidence yet that the cheap default is gone. The thing you reached for precisely because you didn’t have to budget it now needs a budget.

Cheap to use, expensive to run

There is a tell in how Google shipped 3.5 Flash. It went into the free consumer products — the Gemini app, the surfaces ordinary users touch — at the same moment the API rate went up. Writing on launch day, Simon Willison made the point I keep coming back to: the labs are probing the price tolerance of their API customers. Give the model away where it builds habit and market share, charge more where the buyers are businesses that have already wired it into a pipeline and can’t easily rip it out.

That asymmetry matters, because the price signal you feel as a consumer is the opposite of the one you pay as a developer. In the app, 3.5 Flash feels free. In your build, it costs three times what the last Flash did. If your mental model of “Flash” was set by the consumer experience — and Google is working to set it there — you will under-budget the API by a factor of three and not know why the monthly number climbed.

Put a number on it. A team that scoped a feature against Gemini 3 Flash Preview at, say, $400 a month for an agent pipeline does not get a $400 bill on 3.5 Flash; it gets a $1,200 bill for the identical workload, before anyone has changed a single prompt. Nobody approved that increase. It rode in on a model upgrade that the changelog framed as an improvement, and the line item that moved was the one nobody re-checks after the model name stays the same word — “Flash.” The upgrade was opt-out by default in every place 3.5 Flash became the recommended Flash tier, and opt-out pricing changes are the ones that hit the quarterly review as a surprise.

The speed trap

Here is the part that actually worries me, because it is structural rather than a sticker price. A faster model invites more calls. When output streams back four times quicker, the friction that used to make you think before firing a request drops, and the natural response is to fire more requests. So the per-call cost rises at the exact moment the tool is encouraging you to make more of them. Those two curves multiply. A model that is 3× more expensive per call and used twice as often because it feels snappy is not 3× your old bill — it is 6×, and the speed is what hides the second factor.

I have watched this happen in my own usage within a day of a faster model landing. I stop batching. I stop pruning the prompt. I re-run things I would previously have gotten right the first time, because the re-run is instant and the cost of being sloppy is invisible until the statement arrives. Speed is a genuine feature and also a behavioral tax, and the tax is collected per token at the new higher rate.

The steelman, which is real

The case for paying it is not nothing. For latency-bound work — an interactive agent loop, a tool a human is sitting and waiting on, a pipeline where wall-clock time is the binding constraint — four times the throughput can pay for three times the rate and then some. If the faster model keeps one engineer in flow instead of context-switching while a slower model grinds, the math favors the expensive Flash easily. I am not arguing 3.5 Flash is overpriced for what it does. I am arguing that “what it does” now demands the same cost discipline you would apply to a frontier model, and the name actively discourages that discipline.

What I changed

So I stopped pricing models by their headline rate and started pricing them by cost per task: how much it actually takes to get one useful unit of work done, runs and retries included. That number is the only one that survives contact with a faster model, because it captures the call-volume effect the per-token rate hides. And I retired “Flash” as a synonym for “cheap” in my own head — it is a speed tier now, nothing more, and treating it as a budget tier is how the bill gets away from you. The next thing 3.5 Flash plugs into is an environment built to run agents unattended and in parallel, which is where the call-volume problem stops being a personal habit and becomes an agent that runs while you sleep. For the release details, see Google ships Gemini 3.5 Flash and Antigravity 2.0; for how the vendors converged on this pricing playbook, AI tool pricing convergence.