Using Gemini 3.5 Flash without surprise bills
Published 2026-05-22 by Owner
Gemini 3.5 Flash is available in three places a developer is likely to reach it: the GitHub Copilot model picker, Google AI Studio, and Antigravity 2.0. It is fast, and at $1.50 per million input tokens and $9 per million output it costs three times the Flash it replaced. The goal here is to use it on purpose — where the speed pays for itself — and to bound what it can cost when it does not.
Price the task, not the token
The number that matters is cost per completed task, not the per-token rate. A rough estimate for one agent turn, using the published rates:
IN_TOKENS=12000
OUT_TOKENS=3000
echo "scale=4; ($IN_TOKENS/1000000)*1.50 + ($OUT_TOKENS/1000000)*9.00" | bc
Run that for a typical turn, then multiply by how many turns a real task takes — including retries and the turns you fire only because the response is fast. That product is the figure to compare against a slower, cheaper model. A 3.5 Flash call that costs three times a 3.1 Flash-Lite call but finishes a task in half the turns can still come out behind if the speed tempts you into twice the turns.
Select it deliberately
In Copilot, the model is per-session in the chat model picker; selecting Gemini 3.5 Flash there does not change your org’s base model. In Google AI Studio and the Gemini API, set the model explicitly on the request rather than relying on whatever the default is:
model: gemini-3.5-flash
Confirm that exact identifier against Google’s current API docs before wiring it into code — the string above follows Google’s naming convention but is not a guarantee. The point either way is to name the model on every request, so a default flip on Google’s side never silently moves your spend.
Cap what it can spend
Set a hard ceiling where the spend actually happens. In a script that drives the API, refuse to start a run that the estimate says could exceed your per-task budget:
MAX_USD_PER_RUN=0.50
EST=$(echo "scale=4; (12000/1000000)*1.50 + (3000/1000000)*9.00" | bc)
awk -v e="$EST" -v m="$MAX_USD_PER_RUN" 'BEGIN{exit !(e<=m)}' || { echo "over budget"; exit 1; }
For Copilot, the budget lever is premium-request limits rather than a per-call estimate; the mechanics are in managing Copilot costs.
When Flash 3.5 earns its price
Reach for it when wall-clock latency is the binding constraint: an interactive agent loop a human is waiting on, or a pipeline where four-times-faster output shortens a critical path that costs you more than the token premium. Reach for a cheaper or older model for bulk, low-stakes, non-interactive work where nobody is watching the clock — overnight batch jobs, large low-priority backfills, anything where latency is free. The heuristic that holds up: if a human is actively waiting on the output, the speed is probably worth the premium; if a machine is waiting, it is almost never worth it, because a machine does not mind waiting and the premium is pure loss. A nightly job that runs in eight minutes instead of thirty has saved nobody anything and spent three times the tokens to do it. The general decision framework is in the AI model selection decision.
Watch the number that arrives
Speed encourages more calls, so the per-call saving you imagine can be erased by call volume. Track spend weekly, not monthly — a faster model changes your habits within days, and a monthly invoice is too slow a feedback loop to catch the drift before it compounds. If the weekly number climbs while your output does not, the speed is spending your budget for you. The argument for why the volume effect is the part that bites is in the Flash that got expensive; the release details are in Google ships Gemini 3.5 Flash and Antigravity 2.0.