A three-tool AI code review loop: Cursor, Copilot, and Aider on the same PRs

A four-engineer team tested a simple idea: before opening a pull request for human review, run the diff through AI review from more than one angle.

The tools were Cursor, GitHub Copilot, and Aider. The goal was not to replace reviewers. The goal was to stop wasting reviewer time on issues the author could have caught alone: missing tests, inconsistent naming, unused branches, vague error handling, and accidental behavior changes.

The result was useful, but only after we made the loop smaller and more opinionated.

The baseline

The team worked on a TypeScript and Go codebase:

two frontend engineers
two backend engineers
roughly 8 to 15 PRs per week
median PR size around 420 changed lines
required human review before merge
CI already ran tests, lint, typecheck, and integration checks

Before the experiment, reviewers spent a lot of time on issues that were valid but not deep:

missing test for a new branch
unclear error message
inconsistent function naming
forgotten loading state
accidental change to an exported type
docs not updated after behavior changed

Those comments were necessary, but they were not a good use of scarce reviewer attention.

The first version failed

The first version of the process was too broad:

Ask Cursor to review the diff.
Ask Copilot to review the diff.
Ask Aider to review the diff.
Paste all comments into the PR description.

That produced noise. Each tool found a few useful items, but also repeated the same generic advice. The author then had to review the AI review before the human review. That was not a win.

We changed the rule: each tool gets a narrow job.

The review loop that worked

The final loop:

Cursor: review the local diff for product and frontend-state issues.
Copilot: review the code in IDE context for missing tests and obvious language-level mistakes.
Aider: run a repo-aware diff review focused on behavior changes and files affected by the patch.
Author: fix or reject findings, then write a short PR note: “AI pre-review done; fixed X, ignored Y because Z.”

The prompts were short and repeated.

Cursor prompt:

Review this diff for user-visible behavior changes, loading states,
empty states, and inconsistent UI behavior. Do not comment on style
unless it affects behavior.

Copilot prompt:

Review the selected files for missing tests, null handling, and
language-level mistakes. Keep findings specific. No generic advice.

Aider prompt:

Review the git diff for behavior changes that are not covered by tests.
Focus on exported APIs, data migrations, and error handling. Do not edit files.

The “do not edit files” instruction mattered for Aider. This was a review loop, not an implementation loop.

What each tool caught

Cursor caught frontend behavior issues. It was the best at noticing that a loading state existed on one tab but not another, or that a disabled button did not explain why it was disabled.

Copilot caught local test gaps. It was good at pointing out a new branch in a function that had no matching unit test. It also caught a few TypeScript narrowing issues before CI did.

Aider caught repo-level side effects. Because Aider’s workflow is diff-centric, it was better at asking, “This exported helper changed shape; where else is it used?” It caught two issues where a backend helper change affected a path not touched in the PR.

None of the tools was reliably better on everything. The value came from giving them different lanes.

The numbers

We ran the loop for 28 PRs over three weeks.

Metric	Before	During loop
Average human review time	42 min	29 min
Median PR comments from humans	7	4
PRs returned for missing tests	9 of previous 28	3 of 28
AI comments rejected by author	not tracked	about 45%
Production regressions	1 in prior period	0 during test

The rejected-comment number is important. Almost half of the AI findings were not worth acting on. The loop worked anyway because the author filtered them before the PR reached reviewers.

Examples of useful findings

Cursor found that a billing-settings page had an empty state for “no payment methods” but no empty state for “payment methods still loading.” The component showed a blank panel for about a second on slow connections. Easy fix.

Copilot found that a new parseWebhookEvent branch accepted an invoice.deleted event but no test covered deleted invoices. The test took five minutes to add.

Aider found that changing a Go helper from returning nil to returning an empty slice altered JSON output from null to []. Both shapes were valid Go, but one API customer depended on the old shape.

These are exactly the issues that reviewers should not have to discover manually.

Examples of bad findings

The tools also produced weak comments:

“Consider improving error handling” on code that already used the project pattern.
“Add logging” in paths where logging would have leaked sensitive fields.
“This could be refactored for readability” with no concrete defect.
“Use memoization” on a component where render cost did not matter.

The team adopted a rule: if an AI comment is not tied to a concrete risk, ignore it. Do not debate it. Do not paste it into the PR.

The human review changed

The best outcome was qualitative. Human reviewers spent less time on obvious cleanup and more time on design questions:

Is this the right data model?
Does this endpoint shape age well?
Are we leaking internal concepts into the UI?
Does this migration have a rollback path?

That is where human review belongs. AI pre-review did not make review unnecessary. It made review less cluttered.

What I would keep

The loop is worth keeping, with constraints:

run it before opening the PR, not after reviewers are already engaged
assign each tool a narrow review lane
require the author to filter findings
never paste raw AI review output into the PR
track whether human review time actually drops

If review time does not drop, the loop is ritual. Kill it.

Verdict

The three-tool loop worked because it made authors do better self-review. It did not produce a magic reviewer. It produced a structured pause before asking teammates for attention.

For teams already using Cursor, Copilot, or Aider, this is a practical pattern: make the AI review the boring parts first, then make humans review the judgment calls.