Codex reasoning effort: how the low/medium/high knob actually changes behavior

Published 2026-05-11 by Owner

Most Codex sessions start at the default reasoning level and stay there. That default is medium — a reasonable prior, but not an optimal one. Leaving the knob in one position across a whole workday means paying high-mode prices for low-mode tasks and getting low-mode quality on decisions that needed more thought.

The reasoning_effort parameter is narrow in scope: it controls deliberation depth before the visible response. Before the model produces the text you see, it runs an internal reasoning pass — a private chain of thought that does not appear in the output but does influence it. The effort level sets the token budget for that pass. More budget means more steps, more considered tradeoffs, more self-correction. It also means more latency and more cost.

Used at the right moment, that extra deliberation is one of the most cost-effective quality levers in the Codex CLI. Used at the wrong moment, it is just waiting.

The three levels and what they actually do

The parameter accepts three string values: low, medium, and high. The visible output length is similar across levels. What changes is how much deliberation precedes it and how long that takes.

Low minimizes the reasoning budget. The model jumps to a response with minimal internal steps. For a narrow task where the answer is structural pattern-matching — adding a missing import, renaming a symbol across a file, completing a docstring — low produces the same output as medium and does so faster. Latency is sub-second to low single-digit seconds. The model is not less capable at low; it just does not have the token budget to second-guess itself.

Medium is the shipped default. The model runs a moderate reasoning pass: enough to decompose a multi-step task, evaluate a few approaches, and do one round of self-checking. For most of the distribution of real tasks — targeted feature additions, small refactors, debugging a stacktrace — medium lands the right answer reliably. Latency is roughly 5–20 seconds on a non-trivial prompt. The default is set here for a reason.

High turns on extended internal reasoning. The model may spend many more steps in deliberation before producing its response. For hard inference problems — designing a module boundary, evaluating concurrency options, assessing whether an abstraction will hold under extended requirements — high produces noticeably sharper output than medium. The cost: 30–90 seconds of response time per turn is common. On a complex multi-constraint problem, occasionally longer.

The reasoning token budget scales roughly 3–5x from low to high. A medium-effort session costing $0.08 might cost $0.25–0.35 at high on the same prompts. These numbers are directional; actual costs depend on model tier and prompt length. The cost delta is meaningful but usually manageable for occasional high-effort prompts. The latency delta is where most workflows actually feel the difference.

Latency as a workflow variable

Cost scales in a way that is easy to absorb. Latency scales in a way that changes the feel of working.

At low, Codex responds fast enough to feel close to autocomplete. The loop is tight: type, receive, type again. Iteration is cheap and the rhythm supports it. This matters for sessions where the work is essentially a sequence of mechanical changes — twenty small cleanups, a sweep of repetitive tests, filling out a boilerplate structure.

At high, each turn introduces a visible wait. Waiting 60 seconds for a response on a 10-word prompt is fine when the question is “how should this service handle partial failures?” It is expensive when the question is “add a newline here.” The accumulated latency over a two-hour session at high effort — on tasks that did not need it — can cost 20–30 minutes of waiting for nothing.

A session running at high for mechanical edits does not produce better mechanical edits. It produces slower ones.

One calibration from practice: a 90-minute cleanup session can stretch to 120 minutes at high effort, with no difference in output quality. The 30 extra minutes is pure wait — the cost of leaving the knob in the wrong position.

Quality differences by task class

Three categories worth distinguishing:

Architecture and design decisions. This is where high earns its latency premium. Designing a caching strategy, evaluating a schema migration path, choosing between two module structures — these require the model to hold multiple constraints simultaneously, identify edge cases it was not told about, and compare options across more dimensions than a quick pass can cover. At medium, the model produces a reasonable answer. At high, the answer is genuinely sharper: more edge cases surfaced, assumptions questioned, tradeoffs named explicitly. The improvement is observable, not marginal.

The other thing high mode does for design tasks is tighten reasoning about tradeoffs the medium pass glosses over. Medium might say “use a distributed cache.” High is more likely to follow up: “that introduces consistency lag — if read-after-write consistency is required, the cache invalidation strategy needs to account for that explicitly.” That follow-up reasoning is not present in every high-effort answer, but it appears far more often than at medium.

Refactoring existing code. Medium and high land in roughly the same place. Refactoring is bounded work: the constraints are visible in the existing code, the target state is well-defined, and the task rarely calls for novel inference. The model at medium correctly identifies what needs to change and produces it. High occasionally catches an additional thing the medium pass missed — but not reliably enough to justify the latency on every refactor.

Mechanical completion tasks. Low is correct. Renaming symbols, adding missing imports, writing a docstring from an existing signature, updating a constant — these tasks are pattern recognition, not reasoning. The model at low completes them correctly and quickly. Using medium or high for these tasks costs time without buying quality.

The categories are not always clean. A task that looks mechanical can have a non-obvious constraint — renaming a symbol used in a serialization format, for instance, where the rename breaks backward compatibility. For tasks with hidden constraints, the fix is including the relevant context in the prompt so the model knows the constraint exists — not pre-emptively raising the effort level.

A shortcut that holds up in practice: if the task requires the model to make a judgment call about design, effort, or approach — consider high. If the task is executing a pattern the model can recognize immediately — low works. Everything else is medium.

A rough mapping:

Task type	Starting level
System design, architecture decisions	high
Debugging a non-obvious root cause	high
Multi-file refactors	medium
Feature additions with clear spec	medium
Single-file changes, completions	low
Mechanical symbol renames	low

A worked example: designing a rate-limiting strategy for an API at high effort surfaces questions about burst handling, per-user vs. per-IP bucketing, token bucket vs. sliding window tradeoffs, and edge cases around auth failures. At medium, the model proposes a sensible strategy but may not surface the burst-handling question unprompted. At low, it produces a generic answer. The difference between high and medium here is the difference between catching a design issue in 90 seconds and catching it after three hours of implementation. That is where the latency premium on a single turn pays back.

How effort interacts with model choice

reasoning_effort operates within a model tier. It is not a substitute for model selection.

Setting reasoning_effort: high on a mid-tier model gets more deliberation from that model. It does not get the output quality of a larger model at medium. If the ceiling of a given model is too low for a task — if it lacks domain knowledge, makes systematic errors on a specific class of problem, or consistently misses a kind of constraint — raising reasoning effort will not close that gap. The model thinks harder, but the quality of each reasoning step is bounded by its weights.

The correct order is: choose the right model tier for the class of work, then tune reasoning effort within it. High effort on a capable model is genuinely strong. High effort on the wrong model tier is slow and expensive.

Two axes, not one: when Codex gives a poor answer, the question is whether the failure is a model-tier limit or a deliberation limit. Does switching medium to high change the answer? If yes, it was deliberation depth. If not, the fix is context or model tier, not effort.

The Codex CLI config sets the persistent default:

{
  "model": "o4-mini",
  "reasoning_effort": "medium"
}

Override per session with the CLI flag — the flag wins over config for that session:

codex --reasoning-effort high "design the caching layer for this service"
codex --reasoning-effort low "add missing semicolons in config.ts"
codex --reasoning-effort medium "refactor the auth module to use the new token interface"

The flag does not change the persistent config. The next session starts at whatever the config says. For shared team workflows, putting per-task defaults in a shared config keeps individuals from running high-effort sessions on tasks that do not warrant it.

A heuristic for choosing the starting level

Start at medium and stay there until the task type pushes in a clear direction.

The signal to move to high: the model at medium gives shallow advice. It misses an edge case that should have been obvious, proposes a design that breaks under ordinary load, or produces a plan that ignores constraints already visible in the codebase. That is the model failing to think through the problem — not a knowledge gap, but a deliberation gap. Rerun the prompt at high.

The signal to move to low: the session is going to be long, the work is mechanical, and accumulated latency is the bottleneck. Switching to low recovers the tempo. If the output quality at low matches the output quality at medium on the tasks being run, there is no reason to stay at medium.

Think about the next hour of work, not just the next prompt. If the hour is “ship a refactor that has already been designed,” low or medium throughout. If the hour is “figure out how to structure a new module and start writing it,” open at high and drop effort for the implementation turns.

A concrete session pattern: use high for the design turn that opens a feature session — ask the model to surface tradeoffs and produce a plan. Then drop to medium or low for the implementation turns. The expensive reasoning happens once, on the question that actually requires it.

The failure mode is the reverse: using low for the design phase because it is faster, then discovering the implementation is built on a shallow foundation. One bad design decision that takes three hours to undo costs more than the extra 60 seconds the high-effort design turn would have taken.

The effort level is the cheapest thing to adjust in a Codex session. Matching it to the task type at the start of each work block is a low-friction habit with a real return.

What does not change with reasoning effort

reasoning_effort affects how deeply the model deliberates on a problem. Several things it does not affect:

Model knowledge. The model’s knowledge of libraries and APIs does not change with effort level. If it is wrong about a function signature at medium, it will be wrong at high. The fix for knowledge-limit failures is explicit context in the prompt — paste in the relevant docs or type signature. High effort applied to wrong premises produces confident wrong answers.

Hallucination on version-dependent behavior. If a model hallucinates a method that does not exist in a framework version, it will do so consistently across effort levels. More deliberation does not help with retrieval errors.

Mechanical accuracy. Low and high both get the right answer on structural code transformations that do not require judgment. Adding reasoning effort to a simple substitution request does not produce a more correct substitution.

Context window limits. High reasoning effort does not extend what the model can attend to. If relevant code is absent from the prompt, reasoning harder does not surface it. For tasks where the model keeps missing something, the first check is whether the relevant constraint is actually in the context.

Understanding where the parameter applies and where it does not is what makes it a useful tool rather than a general-purpose quality switch. The mental model: reasoning_effort adjusts the depth of internal deliberation. It is the right lever when the quality bottleneck is deliberation depth — when the model is answering too quickly and missing things it could have found with more thought. It is not the right lever when the quality bottleneck is knowledge, context, or model tier.

The practical diagnostic for a failing prompt: does switching from medium to high change the answer? If yes, the issue was deliberation depth. If not, feed the model missing context and try again at medium.

As the Codex CLI matures, per-task reasoning effort configuration is the likely direction — setting the level in a task definition rather than for the whole session, so a pipeline can use high for planning and low for verification without manual switching. Until then, adjusting the session-level flag at task-type transitions is the available approximation.