Supervision from your pocket

On May 14 OpenAI put Codex inside the ChatGPT mobile app. You pair your phone to the Mac that is actually running the agent, and from then on the phone can review diffs, approve the commands Codex wants to run, and redirect a task that is already in motion. The work stays on the laptop. The approval moves to your pocket.

I want to be precise about why this bothers me, because the easy version of the complaint is wrong. The problem is not that phones are bad for code. The problem is which step they moved onto the phone.

The step they moved is the load-bearing one

A month ago I wrote about the supervision paradox: agentic tools only stay safe if a competent engineer reads the diffs and catches the wrong turn, and the act of delegating is what erodes the skill that reading requires. The whole argument turned on one fragile assumption — that the human actually does the reading. Everything good about agentic coding is downstream of that single step being done well.

Codex mobile takes that exact step and makes it something you do between other things. Not the writing, not the planning — the approving. The one point in the loop where human judgment is supposed to enter is now available on a six-inch screen while you are half-paying attention to something else. The release is well built and the onboarding is smooth, which is the part that worries me rather than reassures me.

Approving on a phone is not reviewing

Be honest about what diff review actually is. It is reading the change line by line, predicting what the next hunk will do before you scroll to it, and holding the surrounding file in your head well enough to notice when the change quietly crosses a boundary you cared about. None of that survives a thumb-scroll in a queue. You can look at a diff on a phone. You cannot review one, in the sense that matters, any more than you can proofread a contract from a notification preview.

The Anthropic figure I keep coming back to — debugging skill down 47% among heavy-AI users — did not come from engineers who refused to review. It came from people who reviewed less carefully than they believed they were reviewing. A phone is, structurally, a machine for reviewing less carefully than you think you are. It is optimized for glance-and-confirm, and glance-and-confirm is the precise motion the paradox said would erode you.

And the failure is invisible at the moment it happens. A diff you rubber-stamped from a queue does not announce itself: it merges clean, the tests stay green because the agent wrote the tests too, and the cost surfaces weeks later as a bug in code that no human ever actually read. That delay is the whole problem. The phone does not make you approve worse changes today; it removes the small moment of friction that used to be the place you caught the bad one, and it removes it precisely when you are least able to notice it is gone — moving, distracted, mid-errand. A skipped read feels exactly like a completed one until something breaks.

The steelman, which is real

There is a genuine counter-argument and I do not want to wave it away. Redirection is not review. Catching a running agent that has gone down the wrong path and stopping it from your phone is real value, and it has nothing to do with the reading problem — you are not evaluating a finished change, you are aborting a bad one. The same is true of unblocking: if the agent is stalled waiting on a yes/no you know cold, answering it from a queue is latency reduction on a decision you already made, not judgment compressed onto a small screen.

Those uses are good. They are also not the dangerous one, and the release bundles all of them into the same frictionless surface, which is how the dangerous one gets done by accident.

The line I draw

So I do not think Codex mobile is bad. I think it collapses two actions that should stay separate. Redirecting and unblocking from a phone: fine, sometimes excellent. Approving a diff you have not actually read, on a screen too small to read it on: that is the rubber stamp the paradox predicted, now with a notification badge. The danger was never that supervision is hard. It is that supervision is easy to skip, and the skip now has a great onboarding flow.

What I changed in my own setup is small and categorical. Codex mobile is on, and it does exactly two things for me: it kills a run that has gone wrong, and it answers a yes/no I would have answered identically at the desk. Diffs are not approved from the phone, ever. The rule has to be categorical, because the situational version — “I’ll only approve small diffs on mobile” — erodes under exactly the conditions that make the phone tempting in the first place: you are busy, you are away from the desk, and the agent is waiting on you.

The reason this matters past my own setup is that the same pressure is about to arrive at scale, from the other end — not one developer approving more from a phone, but entire orgs announcing that most of their code is now agent-written. The convenience and the scale are the same story at two zoom levels, and the cost lives in the same place: the step everyone is most tempted to skip. That is the next thing worth looking at honestly, in the 60% claim, deconstructed. For the release details, see OpenAI brings Codex to the ChatGPT mobile app.

The step they moved is the load-bearing one

Approving on a phone is not reviewing

The steelman, which is real

The line I draw

Codex CLI