A decision framework for which AI model to use on which task

Published 2026-05-11 by Owner

Most developers who use AI coding tools settle into one model and stay there. Usually it’s whatever the tool defaulted to when they signed up, or the most expensive option because “expensive means better.” That instinct isn’t entirely wrong — the flagship models are genuinely more capable in measurable ways — but it ignores a more useful question: better at what, for whom, and at what cost per accepted suggestion?

Model selection treated as a routing problem — not a “pick the best once” decision — consistently produces better outcomes. The savings are real, and the quality tradeoffs are smaller than most people expect. The catch is that routing requires knowing which axis matters most for each task type.

The four axes

There are four dimensions that determine whether a given model is the right fit for a given task. Each one creates a different pressure, and they don’t all point in the same direction. Most task-routing decisions reduce to asking which axis is the actual bottleneck for this specific task, not which model is generally better.

Cost (per million tokens). The price difference between tiers is large. As of mid-2026, the gap between a flagship model and a fast, cheap model runs roughly 10-20x on input tokens and 15-25x on output. On a busy day of coding, that spread translates to a $3-8 session versus a $40-80 session.

Cost pressure is highest on tasks that are mechanical and high-volume: autocompletion, docstring generation, rename suggestions. It’s lowest on tasks that are rare and consequential: system design, architecture decisions, security audits. The asymmetry matters because most token volume in a typical coding session comes from the high-volume tasks, not the rare ones.

Most developers who look at their monthly AI spend for the first time are surprised by where the volume is. It’s almost never the architecture discussion that costs the most. It’s the completions that fired 400 times in a day while they were writing boilerplate.

Intelligence (benchmark and real-world). The flagship models are noticeably better at holding complex constraint sets in mind, catching subtle bugs, reasoning about tradeoffs across multiple abstraction layers, and understanding ambiguous requirements. Cheaper models handle routine tasks well but occasionally miss the third-order consequence.

The practical distinction: if completing the task requires tracking five separate requirements simultaneously, reach for the stronger model. If the task is “write a docstring for this function,” the intelligence gap won’t affect the output you accept.

One useful calibration: when a suggestion from a cheap model gets rejected, what was wrong? If the answer is “it didn’t understand the context” or “it missed a constraint,” that’s an intelligence gap — the model upgrade is warranted. If the answer is “wrong variable name” or “didn’t follow the formatting convention,” that’s a different problem, usually solved by better prompting or project-level rules rather than spending more on the model.

Speed (latency). Fast models produce first token in under 2 seconds; slower flagship models take 8-20 seconds depending on output length. For completions inside a code editor, 15-second latency breaks the flow state that makes the tool worth using. For a one-time architecture document, speed is irrelevant.

Latency pressure is almost entirely about how often a task requires interrupting your thinking to wait. Autocomplete and short-turn debugging loops have high latency pressure; long-form generation tasks have low latency pressure.

Context size (1M vs 200k vs 32k or smaller). Large-context models let you load a substantial portion of a codebase into a single prompt. Cross-file refactors, dependency audits, and maintaining consistency across a large change all require seeing enough code at once. A 32k-context model cannot do what a 200k-context model can do on a 50-file refactor, regardless of other capabilities.

The context ceiling is a hard constraint; intelligence and cost are soft ones that can often be worked around with better prompting or task decomposition.

Per-task recommendation matrix

These configurations have held up across months of daily use across several codebases of different sizes and languages.

Architecture and system design. Use the highest-intelligence model regardless of cost. The cost of a wrong architecture decision dwarfs the difference between a $0.40 prompt and a $4.00 prompt. One misguided database schema that survives into production costs more in rework time than a year of model upgrades. Load the relevant context — existing data models, constraints, the specific decision being made — and pay for the intelligence.

Inline completions and autocomplete. Use the cheapest fast model. The task is pattern-matching: completing a line, suggesting the next argument, proposing a variable name. A fast, cheap model is right 85-90% of the time on these; the 10-15% miss rate is acceptable when each turn costs a fraction of a cent and accept/reject is one keystroke. Routing completions to a flagship model is the single largest source of unnecessary API spend for most developers.

Cross-file refactors. Large context is the deciding factor; intelligence matters; cost is secondary. You need a model that can hold 200k+ tokens at once and track a change consistently through all its call sites. A smart model with a small context window will hallucinate or skip the parts it can’t see. An adequate model with a large context window will execute the mechanical parts of the refactor reliably.

Debugging. Medium-intelligence models, prioritizing speed. Debugging is iteration-heavy: form a hypothesis, evaluate the output, refine, repeat. Long latency between turns breaks the loop in a way that compounds badly.

A typical debugging session generates 15-30 turns before a fix lands. A fast, mid-tier model at $0.05/turn costs $0.75-1.50 total for the session; a slow flagship at $1.50/turn costs $22-45 for the same session with more waiting between each step.

The latency cost is also psychological. A 15-second wait on every turn in a 25-turn debug session means 6+ minutes of idle time during which the problem context fades. A 2-second wait keeps the mental model active. That alone changes how effective the session is.

Reserve the flagship for the bugs that have already defeated three turns of the mid-tier model. At that point, extra reasoning capacity is worth the price and the wait.

Code review and security audit. High intelligence, large context, cost secondary. The cost of missing a security issue is not denominated in API dollars. Load as much relevant code as the context window allows and use your strongest model. These reviews happen infrequently, stakes are high, and per-session token volume is bounded.

Test generation for existing code. Mid-tier models, medium context. Writing tests for code that already exists is mostly mechanical — enumerate cases, write assertions, match the testing library’s patterns. Reserve the strong model for generating tests for complex logic with subtle edge cases where a cheaper model keeps missing the non-obvious paths.

Documentation generation. Cheap fast model, low context needed. Generating a docstring or a function-level comment from the function body is about as mechanical as autocomplete. The cheap model handles this at high volume without quality issues.

A quick-reference summary:

Task                  | Intelligence | Speed | Context | Cost sensitivity
----------------------|--------------|-------|---------|------------------
Architecture          | High         | Low   | Medium  | Low
Autocomplete          | Low          | High  | Low     | High
Cross-file refactor   | Medium       | Low   | High    | Medium
Debugging             | Medium       | High  | Low     | Medium
Code review / audit   | High         | Low   | High    | Low
Test generation       | Medium       | Low   | Medium  | Medium

The “I always use the most expensive” trap

Using the flagship for every task costs 10-20x what it needs to. A concrete worked example:

A developer runs about 500 autocomplete completions per day — conservative for anyone with tab-completion enabled in Cursor or Copilot. Each completion averages 400 input tokens and 80 output tokens. At flagship pricing (roughly $15/M input, $75/M output as a mid-2026 reference point), that is about $28/day in autocomplete alone, or $700/month. Routing those same completions to a fast, cheap model ($0.10/M input, $0.30/M output) costs $2.80/day — $70/month. The quality difference on autocomplete completions for mechanical code patterns is negligible.

The trap persists because of a real psychological pattern: “I already paid for the pro plan” or “I don’t want to compromise quality.” Both sound plausible. Neither survives contact with an itemized monthly bill and a look at what fraction of completions actually required intelligence — versus pattern-matching — to produce an accepted output.

The underlying error is treating model selection as a quality gate (“use the best so nothing goes wrong”) rather than a resource allocation problem (“match the capability to the requirement”). Expensive models are worth paying for when the intelligence gap is the actual bottleneck. For autocomplete, routine generation, and context-retrieval tasks, it usually isn’t. The money saved on mechanical tasks funds the sessions where flagship capability genuinely matters.

A calibration exercise

Pick a task you run frequently — test generation, docstrings, a simple component, a database query. Run it across three models: your current default, a mid-tier option, and the fastest cheap model available. For each run, track:

Wall-clock time from submit to first token
Total cost (most provider dashboards show per-request token usage)
Whether the output was accepted, partially accepted, or rejected
If rejected: was it wrong reasoning, wrong style, or wrong detail level?

Run 20-30 comparisons spread across a week. The results are usually surprising in two ways: the accept/reject ratio between the default model and the cheap model differs by less than expected, while the cost ratio differs by 10x or more. Rejection reasons also tend to cluster — once it’s clear that cheap-model rejections are mostly “wrong import path” rather than “wrong algorithm,” the fix is a prompt change, not a more expensive model.

Here is what that exercise looked like for one mid-sized TypeScript project:

Task: generate a Zod schema from a TypeScript interface (repeated 20x)

Model        | Avg time | Cost/run | Accept rate | Rejection reason
-------------|----------|----------|-------------|----------------------
Flagship     | 11s      | $0.48    | 95%         | Minor style issues
Mid-tier     | 4s       | $0.09    | 90%         | Occasional wrong type
Cheap/fast   | 2s       | $0.02    | 82%         | Missing .optional() calls

The cheap model’s 82% accept rate looks bad until the rejection reason is considered: “missing .optional() calls” is fixable by adding one sentence to the prompt. After that change, the cheap model’s accept rate rose to 91% — within 4 points of the flagship — at 1/24th the cost per run.

The calibration exercise costs $5-15 total in API spend. Worth doing once per quarter, because model capabilities change faster than intuitions about them do. A tier that was noticeably weaker six months ago may have closed the gap.

One habit shift with the highest leverage

The change with the largest payoff: separating the autocomplete model from the chat model in the tool configuration.

In Cursor, chat and autocomplete can use different models. Go to Settings → Models and set a separate model for the Cursor Tab (autocomplete) feature. The flagship stays in chat — architecture questions, complex debugging, code review — while the cheap model handles inline completions. This split requires about ten minutes of configuration and produced roughly 60% reduction in total monthly API spend with no perceptible change in the quality of accepted completions.

In Cline, version 3.4+ supports per-mode model selection. Plan mode gets the strong model for reasoning; Act mode gets the cheaper model for execution. The principle is the same: pay for intelligence where thinking is required, not where execution is mechanical.

The reason the savings are this large: most token volume in a typical session comes from autocomplete, not chat. A single chat turn for a complex question generates 2-5k tokens. A single autocomplete completion generates 400-600 tokens, but there are hundreds of them per day. Routing those to a cheap fast model cuts the dominant cost driver without touching the high-value interactions that justify paying for intelligence.

The same logic applies to any tool that fires model calls at high frequency: test generation scripts, CI linting passes, bulk docstring generation. Run those through cheap models by default; escalate to the flagship only when the cheaper model fails to produce an acceptable output after one or two attempts.

A side effect of this split: when the flagship is reserved for tasks that genuinely need it, the quality bar for what gets sent to it rises. Vague prompts get cleaned up before sending them to the expensive model, because the cost feedback loop is tighter. The forced discipline tends to improve prompt quality across the board.

Model selection is not a one-time decision made at signup. It is a per-task configuration that rewards deliberate attention. The four axes — cost, intelligence, speed, context — apply different pressure depending on the task. Working through that mapping once, and revisiting it quarterly as model capabilities shift, is the highest-leverage configuration change most developers haven’t made.