OpenAI's o3-mini lands strong coding eval scores at low latency

OpenAI shipped o3-mini this week, the cheaper and faster sibling of o3. The notable result for AI coding tools: SWE-Bench Verified scores comparable to o1 at significantly lower cost and latency. For agent-based workflows that benefit from reasoning, this is a meaningful upgrade.

The benchmark numbers

Coding-relevant scores:

SWE-Bench Verified: 53% (up from 49% on o1, with much faster generation)
HumanEval: 92%
LiveCodeBench: 41%
Aider’s polyglot benchmark: 60% (Aider’s own test set across multiple languages)

The SWE-Bench number is the headline. 53% means o3-mini autonomously solves over half of real-world software engineering tasks in the benchmark. That’s competitive with the best of the flagship models.

The “cheaper and faster” part: o3-mini generates at ~70 tokens/sec (vs ~15 for o1) and costs about 60% less per token.

The reasoning angle

OpenAI’s “o” series models do explicit reasoning before producing output. The model generates internal reasoning traces (not shown to users) that inform the final answer.

For coding tasks, this reasoning helps with:

Multi-step plans
Edge case identification
Trade-off analysis
Debugging chains

For tasks that don’t benefit from extended thinking, the reasoning is overhead. Asking o3-mini to write a single-line bug fix is like using a sledgehammer for a thumbtack.

Where this fits in tools

For tools that have model selection:

Aider: o3-mini works well as the architect model in architect mode. Strong reasoning during planning; pair with a faster editor model for the actual edits.

Cline: o3-mini is a candidate for Plan mode. The plan benefits from reasoning; the act mode can use a cheaper model.

Cursor: Cursor’s chat panel can use o3-mini as a model option (when enabled). Useful for architectural questions and complex refactors.

Copilot: Copilot Business and Enterprise tiers can offer o3-mini in their model picker. Adoption depends on GitHub’s rollout.

For all these, o3-mini is best as a “complex task” model, not a default. Routine work goes to faster, cheaper models.

A practical observation

I’ve been testing o3-mini in Aider’s architect mode for a week. Patterns:

For multi-file refactors: o3-mini’s plans are noticeably better than Claude 3.5 Sonnet’s on my test cases. It spots cross-file implications more reliably.

For test-driven development: o3-mini reasons about what edge cases need tests. Better than Sonnet at this; comparable to o1 but faster.

For debugging: When I describe a confusing failure and ask for hypotheses, o3-mini’s hypotheses are more often correct than Sonnet’s. The extended reasoning seems to help.

For routine code generation: No noticeable advantage over Sonnet. The reasoning overhead adds latency without quality benefit.

The split: tasks that benefit from thinking benefit from o3-mini. Tasks that don’t, don’t.

The competitive picture

The reasoning-tier model space:

o3-mini (OpenAI): strong, fast, mid-priced
Claude 3.5 Sonnet thinking (Anthropic): comparable reasoning with extended thinking enabled, similar quality
Gemini 2.0 Pro with deep think (Google): emerging; Google’s announced extended-thinking variant

The gaps between these are narrowing. Each provider has a “thinking” tier; benchmarks are converging. The differentiation is in:

Latency (o3-mini is currently fastest in this tier)
Cost (varies)
Provider ecosystem (Anthropic for Claude tools, OpenAI for ChatGPT/Copilot integration)
Context window (Gemini has the largest)

For users, the choice depends on which provider you’re already on. Switching providers for marginal model gains rarely pays off.

Pricing

o3-mini pricing:

Input: ~$1.10/M tokens
Output: ~$4.40/M tokens

For a typical agent session that uses o3-mini for the architect role:

50k input, 5k output → ~$0.08

That’s competitive with Claude 3.5 Sonnet for similar work. The cost barrier to using a reasoning-tier model has dropped.

What I’d watch

A few things to track:

OpenAI’s continued pace. o3-mini follows o3 quickly. The release cadence on the reasoning tier is fast.

Anthropic’s response. Claude 3.5 Sonnet’s thinking mode is comparable but priced differently. Whether Anthropic adjusts pricing or capability matters.

Tools that lean into reasoning. Cline’s Plan mode, Aider’s architect, Cursor’s “deep mode” if it ships — tools that productize reasoning could be the bigger story than any individual model release.

Open-weight reasoning models. DeepSeek and others have been working on reasoning-capable open models. If a strong open model ships, the BYOK calculus shifts.

Worth adopting?

For users in the OpenAI ecosystem, o3-mini is the obvious next step for tasks that benefit from reasoning. Set it as your “thinking” model alongside a fast model for routine work.

For Anthropic-default users, Claude 3.5 Sonnet’s thinking mode is roughly equivalent. No need to add OpenAI just for o3-mini.

For tooling builders, supporting reasoning-tier models is becoming table stakes. The capability is genuinely useful for agent workflows; users will increasingly expect it.

The continued pace of model improvements is the meta-story. Tools that can swap underlying models flexibly (Aider, Cline, Continue) benefit from this; tools locked to specific models lose ground unless their providers keep up.