Tinker AI
Read reviews
intermediate 6 min read

Codex review mode: a second opinion that doesn't agree with you for free

Published 2026-05-11 by Owner

The review queue is usually the bottleneck. A PR sits for hours or days while the human reviewer finds time. When the review finally arrives, half the comments are about things you could have caught yourself — a null dereference you rushed past, an N+1 you introduced without noticing, a variable name that made sense when you wrote it and makes no sense from the outside.

Codex CLI’s review mode is a first-pass filter for that category of comment. Run it before you push. Fix the obvious issues. By the time a human reads the diff, the mechanical problems are gone and the review can focus on what a human is actually good at: whether this is the right abstraction, whether it fits the codebase, whether the spec was read correctly.

What the command does

The core invocation is straightforward. Pass a diff or a file, get a structured list of findings:

# Review a staged diff
git diff HEAD | codex review

# Review a specific file
codex review src/lib/payments.ts

# Review a range of commits (the diff against the merge base)
codex review --from main --to HEAD

The output is a set of categorized findings, each with a severity label, the relevant line range, and a short explanation:

[HIGH] src/lib/payments.ts:47-49
  Potential null dereference: `user.subscription` is accessed without
  checking if `user` is null. The outer `if (response.ok)` does not
  guarantee `user` is defined.

[MEDIUM] src/lib/payments.ts:83
  N+1 query: `getInvoice` is called inside a loop over `items`.
  Consider batching or loading invoices before the loop.

[LOW] src/lib/payments.ts:112
  Unused variable `retryCount` is declared but never incremented.

The categories are: correctness, performance, security, and style. HIGH maps to correctness issues that can cause runtime failures; MEDIUM to performance or logic concerns; LOW to style and minor inconsistencies.

What it surfaces well

Codex review mode is reliable for a specific class of problems — the ones that are local and mechanical:

Correctness bugs. Off-by-one errors, null dereferences, incorrect boundary conditions on range checks. These are the bugs that a type-checker can’t catch but a careful re-read usually would. Codex catches them because it’s doing that careful re-read on every path, not just the happy path you tested.

Performance concerns. N+1 queries inside loops are the canonical example. The model sees the loop structure and the database call pattern and flags the combination. It also catches things like sorting a large array inside a render function, or re-building a regex on every call when it could be compiled once.

Security issues. SQL injection from unparameterized queries, secret strings in log calls, user-controlled input passed directly to exec or eval. These are high-signal because the patterns are narrow and the model has seen a lot of them.

Style noise. Dead variables, inconsistent naming within the diff, commented-out code left in. Low severity, but catching these before review means reviewers aren’t spending words on trivia.

What it doesn’t surface

The review is shallow in exactly the places where human review is deep:

Architectural intent. Codex doesn’t know if the abstraction you chose is the right one for your codebase. It will catch a bug in the abstraction; it won’t tell you the abstraction itself is wrong. That call requires knowing the history of the codebase, what was tried before, and what the team agreed on.

Spec conformance. The model doesn’t have your requirements doc or your ticket. If you implemented the wrong thing correctly, the review will be clean. This is the most dangerous gap — Codex gives you confidence the code does what it looks like it does, not that what it does is what was asked.

Codebase voice. Style rules in aggregate, the patterns your team has settled on, the way error handling is done here — Codex hasn’t read the surrounding codebase. It will flag things that look wrong by general standards; it won’t flag things that look fine by general standards but are wrong for your project.

Missed test coverage. It won’t tell you the change needed a test and didn’t get one. It reviews what’s there; the absence of tests is invisible to it.

The practical implication: a clean Codex review is a useful signal, but it’s a narrow one. It means the code is probably not broken in obvious mechanical ways. It says nothing about whether the code solves the right problem, fits the team’s standards, or has adequate test coverage. Keep those expectations separate and the tool stays useful.

Reading the output for signal

Most reviews mix useful findings with obvious-stuff-you-already-knew. Filtering for signal:

Run the review. Count the findings. If there are 12 findings, they’re not all equal. The pattern is usually 2-3 HIGHs that are worth stopping for, 3-4 MEDIUMs where some are real and some are noise, and several LOWs that you can address or discard in a minute.

Go to the HIGHs first. Read the line in context, not just the summary. Codex sometimes flags a null dereference that’s protected by a guard the model missed — read the surrounding code before acting. When the HIGH is real, fix it immediately. This is the part of the review that saves you from a 2am incident.

For MEDIUMs: apply judgment. The N+1 concern is often real; the “consider caching this” suggestion is often noise. If you know the call volume doesn’t matter, mark it resolved and move on.

For LOWs: batch them. Fix the ones that take 10 seconds. Skip the ones that require a judgment call about naming philosophy. These aren’t worth spending review time on.

The goal isn’t a zero-finding review — it’s a review where the findings that remain are there because you made a deliberate choice, not because you didn’t look.

Integrating into a PR workflow

The pattern that works:

# 1. Finish the change. Stage everything.
git add -p

# 2. Run Codex review against the staged diff
git diff --cached | codex review

# 3. Fix the HIGHs and any MEDIUMs you agree with
# 4. Re-run to confirm they're gone
git diff --cached | codex review

# 5. Push and open the PR
git push origin feature/payments-retry

This costs maybe two minutes before push. What it buys: the PR description can say “Codex review clean” with a paste of the final output, human reviewers see a diff that’s been at least mechanically validated, and the comment thread will be about the things that actually need a human.

The anti-pattern: running Codex review after opening the PR, as if it’s a substitute for a review response. At that point you’re responding to findings in public and the diff has already been seen without them fixed. Run it before, not after.

One more thing: the re-run step (step 4 in the workflow) is not optional. Fixes introduce new code. That new code can introduce new issues. A single review pass against the original diff and a second pass against the post-fix diff is the whole protocol — two passes, two minutes, reliable output.

The “argue against me on purpose” use case

There’s a use case beyond standard review: asking Codex to specifically find fault with your approach, not just your implementation.

git diff HEAD | codex review --prompt "Focus on cases where this approach 
is fragile or where the assumptions are likely to break. I want reasons 
this is the wrong solution, not just bugs in the current one."

This works when you have a suspicion that you’re missing something but can’t identify it. The standard review mode is conservative — it flags clear problems. The prompted version can surface:

  • Edge cases in the assumptions (“this only works if the caller always holds the lock”)
  • Brittleness under future changes (“adding a second payment provider would require changes in 4 places”)
  • Approaches that would be simpler for the same result

The output here is less reliable than standard review — the model is reasoning about design space, not just reading code. Treat it as a source of questions, not verdicts. If a finding in this mode rings true, that’s a signal worth investigating. If it doesn’t, discard it without guilt.

Where this fits in the larger picture

Codex review mode is useful because human review is expensive and scarce. It’s not useful as a replacement — the things it misses (architectural intent, spec conformance, codebase fit) are the things that matter most in a review.

The right mental model: Codex review is a pre-flight check. It clears the runway before the real review lands. Human reviewers have a finite amount of attention to spend on a diff. Every comment about a null dereference they have to write is a comment they’re not writing about whether this feature was the right call.

Tools that get better over time at the mechanical pass will make human code review more valuable, not less. The human reviewer who isn’t triaging NULL checks is the one who has time to notice that the abstraction you shipped will be a maintenance problem in six months.

The shift in review culture when a team adopts this: reviewers start spending their comments on meaning, not mechanics. The comment thread gets shorter and more useful at the same time. That’s a different outcome from “AI replaces review” — it’s AI handling the part of review that people find least interesting, so people can spend more time on the part they’re actually good at.

Run the mechanical pass before you ask for the thoughtful one.