When I review a PR these days, I’m increasingly aware that I don’t know who or what wrote it. The author field says a name. Some percentage of the diff was an agent’s output. The author edited the agent’s output. The author also wrote some lines themselves. The “code review” I’m doing is on a hybrid artifact whose origin I can’t fully reconstruct.
This isn’t a problem in itself. But it changes what review is actually for, in ways the standard rituals haven’t caught up to.
What review used to be for
Before AI agents, code review served several purposes:
- Catch bugs. Find logic errors, edge cases, integration issues.
- Maintain quality. Ensure code matches project conventions, is readable, is testable.
- Spread knowledge. The reviewer learns what the author did; both share understanding.
- Mentor. Senior reviewers teach junior authors through specific feedback.
- Gate-keep. Prevent the worst code from landing.
Each of these matters less or more depending on the team. A team with strong individual engineers might emphasize knowledge spreading; a team with juniors might emphasize mentoring; a team with strict compliance might emphasize gate-keeping.
The point: review served multiple purposes simultaneously. The act of reviewing produced multiple outputs.
What changes with agentic coding
When the code is partially or fully agent-produced, several of those purposes shift:
Catching bugs. Still important; arguably more important. Agents introduce specific failure modes (plausible-but-wrong, missed edge cases) that humans might catch. The reviewer’s bug-catching role is unchanged or grown.
Maintaining quality. Mixed effect. Agents follow project conventions when configured well, often better than tired humans. Quality maintenance can be partly outsourced to the agent’s rules. But “is this the right design?” remains human.
Spreading knowledge. Reduced. The author may not know what the agent did. The reviewer learning what the author did doesn’t help if the author doesn’t fully know either. Knowledge spreading needs different rituals.
Mentoring. Mixed. Reviewers can still teach principles. But the author’s “I did this” is no longer true in the same way. The teaching becomes more about evaluating output than about authoring craft.
Gate-keeping. Different shape. Gates against agent failure modes (plausibility traps, mass-produced mediocrity) require different signals than gates against human failure modes.
The honest question
Reviewing a hybrid agent/human PR, the reviewer is asking different questions than they used to:
Pre-agent: “Did the author make good decisions?”
Post-agent: “Did the author + agent + the author’s evaluation of the agent produce something good?”
The question has more layers. The author’s role is partially evaluative — they accepted or rejected agent suggestions. The reviewer is now evaluating both the underlying code and the author’s judgment about the code.
This isn’t necessarily worse. It’s different. The skill of evaluating someone else’s evaluation is itself a skill, and it’s the one reviewers increasingly need.
What I look for now
After several months of this, the heuristics I use have shifted:
Does the author understand what’s there? The most important question. If I ask “why does this work?” and they can’t explain, the agent’s output is in the codebase but the author hasn’t internalized it. That’s a problem.
Are there agent-shaped mistakes? Code that looks right structurally but has subtle issues — defensive null checks where they shouldn’t be, generic error handling instead of specific, “as any” type assertions to make the typechecker happy. These are signatures of “agent produced, author didn’t catch.”
Are there human-shaped mistakes? Different signatures. Off-by-one errors that the agent would have avoided, missing test cases an agent would have suggested. When the agent’s contribution was minimal, the human errors are still there.
Is the cohesion right? Agent output sometimes doesn’t fit the surrounding code. The patterns are right in isolation but jarring in context. Cohesion is a thing only humans easily evaluate.
Is the test coverage real? Tests that exercise behavior vs. tests that pass without exercising the right thing. Agents are good at producing the latter; humans are needed to verify the former.
What’s harder
A few things I find harder in agentic-PR review:
Estimating effort. Pre-agent, a 500-line PR was a meaningful chunk of work. The reviewer’s time was calibrated to “understand 500 lines someone thought through.” Now a 500-line PR could be 30 minutes of agent work the author barely thought about. Calibrating my review depth is harder.
Asking “why this approach?” Pre-agent, the author chose an approach. They could explain. Now the agent often chose the approach; the author accepted it. “Why this approach?” gets the answer “the agent suggested it.” That’s not enough.
Trusting the trail. Pre-agent, I could trust git blame and PR history to understand the context. Now I can’t tell from the diff what was agent vs. author. The trail is less informative.
Time-investment match. Pre-agent, the reviewer spent ~1/4 the time the author did. The ratio worked because authoring took longer than reviewing. Now, with agentic acceleration, the author’s time can drop while the review time can’t. The ratio is shifting and I’m not sure where it ends up.
What teams should change
A few practical adjustments I’ve seen work:
Require the author to articulate intent in the PR description. What were they trying to do? What did they consider? What did they reject? This forces the author to do the cognitive work the reviewer needs to evaluate against.
Tag agent contributions explicitly. Some teams ask authors to flag “this section was largely agent-generated; I reviewed but the design wasn’t mine.” The flag changes how the reviewer reads.
Increase test coverage requirements. When the code might be plausible-but-wrong, tests that exercise behavior matter more. Bumping coverage requirements compensates for the higher base risk.
Pair-review for high-risk PRs. Two reviewers on PRs that touch critical paths. The cognitive load of evaluating agent-influenced code is higher; one reviewer may miss what two would catch.
Spend less time on style, more on design. Agents handle style well. Reviewers should ignore style nits and focus on design, integration, and edge cases. The split is different than pre-agent.
What I’m watching
The trends I’m watching for:
Whether teams shift to “less code, more time per line.” Agents can produce a lot of code. Whether teams adjust to “ship less, but each line carefully reviewed” is an open question.
Whether AI-assisted review tools mature. BugBot, Copilot Review, and similar are first-generation. The next generation may catch more agent-shaped mistakes. The arms race is interesting.
Whether merge incident rates change. Are teams shipping more bugs because of agentic code? Less? The data is starting to come in but isn’t clear yet.
Whether review becomes a primary skill. If authoring becomes more agent-mediated, reviewing becomes more critical. Reviewers may become a higher-status role on teams. Or maybe not.
Closing observation
Code review used to be a single thing with multiple purposes. With agentic coding, the purposes are shifting. Some get harder, some get easier, the overall shape of the practice is changing.
Teams that haven’t thought about this are still doing pre-agent reviews on post-agent code. The fit is poor. Bugs ship that wouldn’t have shipped before; reviews take longer than they used to; reviewers are tired and confused.
Teams that have adapted are getting closer to a sustainable rhythm. The new shape involves more upfront articulation by authors, more skeptical evaluation by reviewers, and tighter automated gates around the parts that humans no longer reliably catch.
We’re in the middle of the transition. The endpoint isn’t fully visible. What’s clear is that “code review the way we did it in 2022” doesn’t quite fit anymore, and pretending it does costs the team in ways that aren’t immediately visible.