The Opus 4.7 stack reshuffle

Last week I argued that the average serious developer runs 2.3 AI coding tools, and that the number was stable because each tool held a moat the others could not cross. Opus 4.7 is the first thing since then that makes me want to re-examine that claim honestly, so this is me doing that in public rather than quietly pretending last week’s post still holds unchanged.

What actually changed on April 16

Anthropic released Claude Opus 4.7 on April 16. The numbers are not marginal. SWE-bench Verified went from 80.8% to 87.6%. SWE-bench Pro — the harder, multi-language variant that is a better proxy for real work — went from 53.4% to 64.3%, against 57.7% for GPT-5.4 and 54.2% for Gemini 3.1 Pro. A 10.9-point jump on the harder benchmark in a single release is a large move, and it is the kind of move that propagates downstream into the tools built on the model rather than staying on a leaderboard.

It already has. One widely-shared May scorecard now places Claude Code alone at the top for the first time since that scorecard started tracking. I want to be careful with that claim: it is one evaluator’s opinion, not a fact about the market, and scorecards over-weight whatever the author personally tests. But it is a reasonable opinion held by someone who tests seriously, and it is exactly the kind of claim that, repeated enough across enough developers, becomes a reason people reach for one tool first — which is how benchmark leads turn into habit changes.

What this does to the 2.3 number

In the 2.3-tool developer I said Claude Code owned the “complex agentic task” slot, Cursor owned the editor, and Copilot owned the GitHub thread, and that nobody crossed into anyone else’s slot. Opus 4.7 does not break that structure. It widens one moat.

Claude Code’s slot — hard, multi-step, multi-file work — is precisely the slot that benefits most from a model that is materially better at hard, multi-step software tasks. The release does not let Claude Code take Cursor’s editor or Copilot’s GitHub integration. It makes the slot Claude Code already owned harder to contest. So the honest update to last week’s post is narrow but real: the equilibrium holds, but it is less symmetric than I implied. I described three roughly co-equal moats. After 4.7, one of them is visibly deeper than the other two.

The tool most exposed by that asymmetry is not Cursor or Copilot — they own different slots and are insulated. It is any new entrant trying to win the complex-task slot on raw model quality, which is exactly the corner Grok Build launched into the same week. I made that argument about Grok separately in the crowded agent race; 4.7 is the specific reason that corner got harder to win the same week someone new tried to enter it.

Why a benchmark lead is not a stack win

Here is the part the scorecard framing flattens. SWE-bench measures the model. Your daily tool choice is decided by the harness around the model: latency on a real edit, how context is retrieved from a large codebase, how diffs are presented for review, whether the tool is already inside the editor or the pull request where the work actually happens. Opus 4.7 makes Claude Code’s engine the strongest in the field. It does nothing for Cursor’s editor integration or Copilot’s position inside GitHub Actions — and those integrations are the reason developers keep those slots filled in the first place. A better model widens the gap on the one axis the model controls and leaves every other axis exactly where it was.

This is also why I am not relaxed about the second-order effect. A stronger Claude Code is precisely the kind of thing that tempts you to read fewer diffs, because the output is right more often, and “right more often” is how the supervision problem gets worse without anyone deciding to make it worse. A better engine raises the value of the work and the supervision tax on it at the same time. The improvement is real and the cost is also real; they arrive in the same release.

How I actually rank the stack now

Unchanged in shape, shifted in weight. Cursor stays the daily editor. Copilot stays for GitHub-bound work. Claude Code stays for the hard tasks — but the gap between “Claude Code for hard tasks” and “second-best for hard tasks” is now wide enough that, for the first time, I would tell someone forced to run exactly one tool to run Claude Code, where six months ago I would have said Cursor without hesitating. That is the actual reshuffle: not a change in how many tools people run, but a change in which single tool you keep if you are only allowed to keep one.

What would move me off that position: a Cursor or Copilot release that closes the model gap by routing hard tasks to a comparable model, or a 4.7-class jump from someone else that resets the field. Neither has happened. Until one does, the 2.3 number stays roughly where it was — it just leans harder on one slot than it did the week I wrote it down, and a stack that leans hard on one slot is a different, more fragile thing than a stack of three equal legs, even when the headline number has not moved at all.

What actually changed on April 16

What this does to the 2.3 number

Why a benchmark lead is not a stack win

How I actually rank the stack now

Claude Code

Cursor

GitHub Copilot