For users on BYOK setups, AI tool tokens are direct dollars. Reducing token usage saves money. There’s an obvious temptation to optimize aggressively.
The trap: over-optimizing tokens often costs more than the savings. The right balance is non-trivial.
What costs tokens
The big sources of token usage:
Repeatedly loaded files. Every chat turn loads the conversation context, including any pinned or recently-loaded files. Long sessions accumulate.
Auto-context loading. Tools like Cline auto-load files based on heuristics. Sometimes useful; sometimes wasteful.
Verbose system prompts. The base system prompt for tools is hundreds of tokens. Custom instructions add more.
Long conversations. Each turn includes all prior turns. The 50th message includes 49 previous ones (or summaries).
Tool outputs. When the AI runs tools (file reads, shell commands), the outputs become context for the next turn.
What helps without hurting
A few optimizations that save tokens without degrading quality:
Aggressive .gitignore-style exclusions. Generated code, build outputs, vendored dependencies — exclude from indexing. Tools won’t load them. Savings can be 20-40% on monorepos.
New chat for new tasks. Don’t keep one chat going across multiple tasks. Each task starts fresh, doesn’t carry baggage.
Shorter system prompts. Audit your custom instructions. Remove fluff. Keep the rules; cut the explanations.
Smaller models for simple tasks. Don’t use Claude Sonnet for tasks Haiku can handle. The savings are 10x.
Caching where supported. Anthropic’s prompt caching, OpenAI’s similar features. Tools that use caching well are cheaper for repeated context.
These optimizations are pure wins.
What hurts when over-optimized
Several optimizations save tokens but hurt output quality:
Removing context the model needs. “I’ll save tokens by not pinning the type definitions.” The model writes code with wrong types; you spend more time correcting. Net negative.
Smaller context windows than the task needs. Truncating context to save tokens. The model misses important information; produces wrong output.
Cheaper models for harder tasks. Using Haiku for tasks Claude Sonnet would handle correctly. The cheaper model produces wrong output; iteration costs more than the model price difference.
Brief, ambiguous prompts to save input tokens. Saves a few tokens per prompt; produces vague output that needs more iteration.
Skipping tool outputs. Some tools let you suppress verbose tool outputs. Sometimes this hurts the model’s understanding of what’s happening.
These all save direct token cost but increase total cost (token + your time + iteration).
The real metric
The metric that matters isn’t tokens; it’s tokens-per-useful-output. A session that costs $5 and produces 800 lines of working code is more efficient than a session that costs $2 and produces 200 lines you have to rewrite.
Optimizing for raw token cost is a local minimum. Optimizing for tokens-per-useful-output is the actual goal.
A specific example
I tested two configurations on the same task:
Aggressive cost-saver:
- Smaller context (skip pinning, hope auto-context works)
- GPT-4o-mini for everything
- Brief prompts
Result: 3 hours of work, $2.40 in API costs, output quality such that I had to iterate ~3x more than I’d want.
Reasonable balance:
- Pinned 4-5 relevant files
- Claude 3.5 Sonnet for the work
- Clear, specific prompts
Result: 1.5 hours of work, $4.80 in API costs, output quality such that most outputs were merge-ready first try.
The “expensive” version cost twice as much in API but saved 1.5 hours of my time. At any reasonable hourly rate, the trade is overwhelming.
What I do
The pattern that’s worked for me:
- Pin the files I know are relevant. Don’t worry about saving tokens on this.
- Use the strongest model that’s reasonable for the task.
- Write specific prompts. Don’t truncate to save input tokens.
- Start fresh chats when topics change.
- Use cheap models only for tasks where they clearly suffice (commit messages, simple chat questions).
This isn’t the cheapest possible setup. It’s the most effective. The cost is a few extra dollars per day; the benefit is hours of saved time.
Token-conscious without being penny-wise
There’s a middle ground that’s reasonable:
- Don’t load the entire codebase if you don’t need it
- Don’t keep stale context around
- Don’t use the most expensive model for trivial tasks
- Don’t write 5000-token prompts when 500 would work
These are reasonable. The mistake is going further: aggressively optimizing past the point where output quality drops.
The threshold is “would the savings cover 5 minutes of my time?” If yes, the optimization is worth it. If no, it’s penny-wise.
What I’d recommend
For users on BYOK with cost concerns:
Track your spend weekly. Notice patterns. If costs are reasonable, you don’t need to optimize. If costs are high, focus on the biggest sources.
Optimize the biggest waste first. Generated code in indexing, repeated context across chats, expensive models for cheap tasks. Address these before micro-optimizations.
Don’t optimize prompts at the cost of clarity. Vague prompts save tokens but cost iteration. Keep prompts clear.
Don’t switch to cheaper models indiscriminately. Use the model that fits the task. Save by routing tasks to appropriate models, not by always picking the cheapest.
Recalibrate quarterly. Costs and capabilities shift. What was reasonable optimization 6 months ago might be unnecessary now (caching, cheaper models, etc.).
For users on subscription tools, you don’t pay per token directly, but the same principles apply to your usage limits. Don’t waste; don’t over-optimize.
Closing
Token efficiency is a real concern but a secondary one. The primary optimization is “produce good output without iteration.” Tokens are one input to that; output quality is the goal.
Engineers who optimize tokens above quality tend to have lower productivity than those who don’t. The savings on the API bill are dwarfed by the cost of iteration.
Optimize tokens to a reasonable point. Stop there. Spend your optimization energy elsewhere — on prompt clarity, on tool selection, on workflow patterns. These compound; raw token optimization plateaus quickly.