Codex vs Claude Code: same job, different model families — concrete differences

Published 2026-05-11 by Owner

Two official terminal agents from the two dominant AI labs, both released in 2025, both with similar architecture: a model with tool use, a sandboxed shell, file read/write, and an autonomous loop that runs until the task is done or it gets stuck. The surface-level experience is similar enough that the comparison feels like a product review.

It isn’t. The more useful comparison is behavioral — what running the same task through both reveals about how GPT-family and Claude-family models differ in their approach to the same problem. Those differences are consistent, reproducible, and actually matter for which tool you pick for a given task.

The test case: extract a hook from an entangled React component

The concrete task: a 280-line React component managing a multi-step form. It has local state for current step, validation errors, form values, and a dirty-field tracker. All of it is in one file with four callsites elsewhere in the app. The task: extract a useMultiStepForm hook that the component calls, and update all four callsites.

This is representative — non-trivial enough that autonomous help is worth having, bounded enough that the output can be evaluated clearly.

What Codex did:

Codex read the component, then immediately started writing code. No questions asked. Within the first tool call, it had created a hooks/useMultiStepForm.ts file with most of the state extracted. It ran tsc --noEmit as its next tool call, found two type errors in the dirty-field tracker (the closure had captured stale state because the tracker hadn’t been included in the dependency structure), and corrected them. Then it updated the four callsites.

The callsite updates were mechanical: it replaced the inline useState calls with the hook invocation and destructured the returned values. It did not read the callsites before writing the hook — it inferred the required interface from the original component and confirmed it was correct after the fact.

Total turns: 4. The intermediate state after turn 1 was broken — the TypeScript errors were real — but the loop self-corrected without prompting. Final output was correct.

What Claude Code did:

Before writing any code, Claude Code produced a short summary: “I see four state values — currentStep, errors, formValues, and dirtyFields — one reducer-pattern update for errors, and three event handlers. There’s also a handleSubmit that calls an external API. Should the API call live in the hook or stay in the component? The answer affects whether the hook is reusable elsewhere.”

That’s a question that matters. The API call has side effects. If the hook owns it, the hook becomes harder to reuse in a different form context. If the component keeps it, the hook stays clean.

Given the answer (API call stays in the component), Claude Code wrote the hook. The dirty-field tracker was handled correctly from the start, because the planning step had surfaced the state relationships explicitly before any code was written. It read all four callsites before writing the hook interface, which meant the returned object shape was designed around actual usage rather than inferred from the original component. Total turns: 5. The intermediate state was never broken.

Neither output was wrong at the end. The path to correct was different.

A note on repeatability: this pattern held across multiple runs with different prompts. Codex consistently acted first and corrected via tool feedback. Claude Code consistently surfaced a scoping question before writing code when the task had any ambiguity. The specific number of turns varied, but the behavioral tendency was stable across sessions.

Model-family behavioral differences

Reasoning style. Claude-family models tend to externalize reasoning before acting. In practice, Claude Code narrates what it found and what it’s about to do, often in more words than are strictly necessary. GPT-family models in Codex tend to act, observe the result via a compiler or test run, and then correct. Both approaches converge on the right answer; the Codex path produces a broken intermediate state more often.

For solo work where you’re watching the agent in real time, the Codex behavior is fine — the loop corrects itself. For a task running in the background, or in a context where a mid-run check (like a CI gate or a pre-commit hook) could fail on the broken state, Claude Code’s approach is more reliable. The planning step costs tokens and turns, but it front-loads the error catching.

Code conventions. Claude tends to write TypeScript with more explicit type annotations: return types on every function, explicit types for event handler arguments that TypeScript could infer. Codex leans toward inference and is more willing to let the compiler fill in what it considers obvious. Neither is wrong. If your project has a stricter tsconfig, Claude’s defaults are closer to what it requires. If your project prefers inferring return types to reduce noise, Codex’s output needs less cleanup.

For Python, the divergence shows up in docstring style. Claude writes Google-style docstrings with Args and Returns sections by default. Codex tends toward shorter inline comments. Both can be overridden, but the defaults matter for how much cleanup the first-pass output needs.

Willingness to ask clarifying questions. Claude Code asks them. Codex rarely does. This is the most visible behavioral difference in daily use.

For tasks where the scope is genuinely ambiguous — multiple valid interpretations of the same requirement — Claude’s questions surface decisions that would otherwise get made implicitly, and often incorrectly. The API-call question above is a real example: the “obvious” answer might be to extract it with the hook, but that’s not the right call here.

For tasks where you know exactly what you want and the scope is clear, the questions are friction. A lot of users find Codex faster for this reason — it trusts that you asked for what you want and goes.

Neither behavior is obviously superior. It’s context-dependent.

One underappreciated consequence: Claude Code’s questions create a natural checkpoint where you can correct misunderstandings before they propagate through fifteen file edits. If the model has a wrong mental model of your codebase, a clarifying question surfaces it cheaply. With Codex, the wrong mental model shows up in the output — sometimes as a type error caught by the compiler, sometimes as code that works but isn’t quite what you had in mind.

Error handling style. Claude-family models tend to write more defensive code on first pass — null checks, explicit error handling for async calls, guard clauses at function entry. Codex tends toward the optimistic path and adds defensiveness when a test or type error flags the gap. The Claude output sometimes has more error handling than the task strictly required; the Codex output sometimes has less than it should.

This shows up clearly in async code. Claude Code will typically wrap a fetch call in a try/catch with an error state update. Codex will often write the happy path and add the catch only if a linting rule or explicit instruction asks for it.

As a concrete illustration, both were asked to write a fetchUser function for the same TypeScript project. Claude Code’s first-pass output:

async function fetchUser(id: string): Promise<User | null> {
  try {
    const res = await fetch(`/api/users/${id}`);
    if (!res.ok) return null;
    return (await res.json()) as User;
  } catch {
    return null;
  }
}

Codex’s first-pass output for the same prompt:

async function fetchUser(id: string): Promise<User> {
  const res = await fetch(`/api/users/${id}`);
  return res.json();
}

Both are valid starting points for different codebases. One is not categorically better. But the defaults tell you something about which model family is going to require more or less cleanup in a given context.

Tool-use loop differences

Both agents execute shell commands in a sandboxed environment. The defaults diverge.

Codex runs with a strict sandbox by default. Network access is off, disk writes are constrained to the project directory, and subprocess spawning is restricted to a safe list. Loosening these requires explicit flags at invocation time. The result: Codex feels safe to run in automated or semi-attended contexts, but it will hit restrictions on tasks that genuinely need to fetch a dependency, call an external API during execution, or spawn a subprocess outside the allow-list.

Claude Code’s permission model is more granular. The --allowedTools flag lets you specify which tool categories are permitted: file writes but not shell execution, or shell execution restricted to specific command patterns. You can allow npm install but block arbitrary curl calls. The defaults are also restrictive, but the controls are finer.

A concrete difference: if the task requires reading documentation from the web to figure out an API’s expected arguments, Codex will stop and ask (or infer from its training data and possibly get it wrong), while Claude Code will attempt the fetch if network access is enabled. For library code that’s well within GPT-4’s training data, the difference rarely surfaces. For newer library versions or less common APIs, it matters.

Both agents support auto-approve mode for non-interactive use (--yes for Codex, --dangerously-skip-permissions for Claude Code in automated contexts). Both pause on destructive operations — file deletes, git history rewrites — unless explicitly told otherwise.

One other practical difference: Codex’s default output is more concise. Claude Code narrates tool calls as it makes them (“Reading src/hooks/useForm.ts to understand the current structure…”). This is useful when something goes wrong and you want to understand what the agent did. It’s noise when everything is working.

There is also a difference in how each agent handles errors from its own tool calls. Codex will typically retry a failed shell command with a modified version, adapting based on the error message. Claude Code tends to stop and report the failure before retrying, which gives you more visibility into what failed but also means more turns in the loop before recovery. For long-running automated tasks, Codex’s silent retry behavior can be an advantage or a liability depending on how you feel about an agent that silently works around problems.

Cost comparison

Both tools are available on subscription and via direct API access.

Codex CLI ships as part of ChatGPT Pro and is also available through the OpenAI API. Claude Code runs on Claude models — accessible through Claude.ai subscriptions or the Anthropic API directly.

At subscription pricing, both tiers are comparable for typical developer use. The API route is where the comparison gets more interesting, because token usage varies by behavior.

Claude Code’s tendency to plan before acting uses tokens before any tool calls execute. A planning turn might be 3,000 tokens before the first file is written. On a simple, well-specified task, this makes Claude Code slightly more expensive per completion. On a complex task where the planning prevents a multi-turn correction cycle, the math inverts — you’re not paying for a broken-state recovery that Codex would have needed.

The rough directional: for mechanical, well-scoped tasks with clear requirements, Codex uses fewer tokens per task because it acts directly. For ambiguous tasks or tasks with complex state interactions, the token-cost gap narrows because Codex’s act-then-correct loop is also expensive, just distributed differently across turns.

Subscription users don’t feel this directly (both have usage caps rather than per-token billing at the user tier), but it’s worth understanding if you’re evaluating which to integrate into a workflow at API scale.

Context length is another cost variable. Claude Code uses longer context windows — Claude 3.5 Sonnet and Claude Opus 4 both support 200k tokens — which matters for tasks that require reading a large codebase before generating output. GPT-4o supports 128k tokens, which is sufficient for most individual task contexts but can become a constraint for tasks that need deep cross-file understanding. At the API tier, longer context windows mean more tokens billed per request even when the active reasoning portion is short.

When running both on the same task makes sense

There’s a pattern worth making explicit: using both agents as independent reviewers on the same task.

Send the same prompt to Codex and Claude Code. Compare the outputs before committing either. This surfaces several things:

Model-specific blind spots. If both outputs miss the same edge case, it’s probably a genuinely hard case or an underspecified requirement. If only one misses it, that’s a model-family bias — something in how GPT-family or Claude-family weights that particular pattern. The divergence tells you where to look.

Idiomatic disagreements. Codex and Claude Code will sometimes produce functionally equivalent code with different structure. One might extract a helper function; the other might inline it. Neither is wrong, but seeing both makes the choice explicit rather than implicit.

Scope interpretation differences. When a prompt is ambiguous, the two agents often interpret the ambiguity differently. Codex will pick an interpretation and execute it. Claude Code will often ask about it. Either way, where they diverge tells you something about the underspecified parts of your original prompt — which is useful signal before you commit.

The cost of running both is doubled token spend on that task. For exploratory work — figuring out the right structure for something new — the cost is worth it. For high-stakes refactors where you want a second opinion before touching shared code, same. For routine tasks you’ve done dozens of times with a clear pattern, it’s overhead with diminishing returns.

A practical workflow: run Codex first for the speed, then run Claude Code if the Codex output looks off or if the task was ambiguous enough that you want to see a different interpretation. You’re not looking for a winner; you’re looking for the places where the two diverge, because those are the places that warrant closer attention.

The dual-run pattern is most valuable for tasks you’re doing for the first time in a given codebase — writing the first data-fetching layer, extracting the first shared hook, adding the first auth middleware. These are moments where the model’s interpretation of your codebase’s conventions is uncertain, and divergence between two model families is meaningful signal. For tasks you’ve already done once (a second page component, a third API endpoint), the model family differences matter less because you have a reference implementation to check against.

What to expect going forward

Both agents are under active development. Codex is getting more configurable sandbox rules and more transparent logging. Claude Code is getting better at scoping its pre-action narration to only the parts that genuinely matter for the task.

The model families will continue to have different defaults and different blind spots. That’s a function of training differences that reflect genuinely different design choices about when to act and when to reason first — not differences that will converge with more compute.

The useful frame isn’t which agent to adopt permanently. It’s which behavioral profile fits the task at hand. Clear, mechanical tasks with a well-understood outcome: Codex’s act-first loop is faster and gets out of the way. Ambiguous tasks or refactors with complex state interactions: Claude Code’s planning step pays for itself. Tasks where the stakes are high enough to warrant a second opinion: run both and treat the divergence as a signal.

One prediction worth making: as both agents get more instruction-following capability, the default behavioral differences will become more like dials than fixed properties. Codex’s sandbox will get more configurable; Claude Code’s verbosity will get more tunable. The model-family differences in reasoning style and code conventions will likely persist longer than the tooling differences, because they’re rooted in training rather than configuration. Knowing which model family tends to catch which class of error — and where each tends to under-specify — will remain a useful thing to know even as the surface-level controls converge.