Pair programming with AI: three patterns, three failure modes

Published 2026-05-11 by Owner

Traditional pair programming has a driver and a navigator. The driver types; the navigator thinks ahead, spots mistakes, and directs the next move. The roles swap periodically. When AI enters the pair, the mechanics invert.

With an AI coding tool, the human rarely holds the keyboard in any meaningful sense. The model types thousands of characters per minute. What the human does instead is closer to what the navigator always did: decide what to build, notice when the direction is wrong, and redirect. The difference is that the navigator seat never switches back. The human stays in it.

That role permanence changes what productive AI collaboration looks like. There isn’t one pattern — there are three, each suited to a different kind of problem, each with a distinct failure mode built in. Understanding which pattern fits which situation is more useful than any single prompt trick.

Pattern A: AI as junior developer

The most reliable pattern. The human knows what to build and how to build it; the AI executes it faster than typing permits.

The workflow is directive. Here’s a concrete example of a prompt that works well:

Add a `validateSlug` function to `src/lib/schema.ts`.
- Accepts a string
- Checks it against `/^[a-z0-9-]+$/`
- Throws a typed `ValidationError` on failure
- The `ValidationError` class is already in `src/lib/errors.ts`

The model produces the function, the import, possibly a test stub. The human reviews and accepts, adjusts, or rejects.

This pattern works consistently because it plays to what AI models are actually good at: translating well-specified intent into code quickly, under tight constraints, without needing judgment. The human supplies the judgment up front. The model handles the mechanical execution.

The first failure mode is specification drift. When the instruction is vague — “add a slug validator” — the model fills the gaps with its own assumptions. Some assumptions will be fine. Others will quietly introduce patterns that don’t match the codebase: a different error-handling style, a dependency that already has an equivalent in the project, a parameter name that conflicts with an existing convention. None of this is malicious; it’s the model defaulting to what it would write in a generic context.

The fix is always the same: tighten the spec before sending the prompt. A rough heuristic that works well: the prompt should be specific enough that a competent human contractor could implement it correctly from the text alone, without asking follow-up questions. If the contractor would need to ask, the model will guess — and the model’s guesses will look confident and plausible regardless of whether they’re correct.

The second failure mode is scope blindness. The model correctly implements what was asked and misses what wasn’t asked for. “Add rate limiting to the API endpoint” produces rate-limiting code that is technically correct, but doesn’t add tests, doesn’t update the API documentation, and doesn’t handle the edge case the human had in mind. A junior human developer would ask about scope; AI models don’t.

Compensating for this means being explicit about what “done” looks like:

Add rate limiting to `src/pages/api/tools.ts`.
Done means:
- Implementation with configurable limit (default 60 req/min)
- Tests for the 429 path in `tests/api/tools.test.ts`
- A note in `docs/api.md`

That kind of done-definition feels like extra work. It’s actually the same work moved earlier — the mental model you’d form anyway after reviewing the output, moved to before the output was generated.

Pattern B: AI as rubber duck

The classic rubber duck technique — explaining a problem out loud until you hear the answer in your own voice — works better with an AI duck because the duck can ask back.

The setup differs from Pattern A. There’s no specific implementation request. The human describes a problem, and the model is steered toward questioning and reflection rather than immediately producing a solution:

“I’m trying to decide where to put the affiliate URL resolution logic. Right now it’s inside a component, but the tests are hard to write because of that. I could move it to a utility module, but then I’d need to thread affiliate data down through a lot of props or pull it from global state, and both of those feel wrong.”

A well-directed response to that prompt does not immediately recommend an architecture. It asks:

“What’s the primary test you’re trying to make easy to write?”
“Is the difficulty in mocking the data, or in isolating the function logic itself?”
“What would the ideal test look like if there were no structural constraints?”

Those questions make the human think. The answer often surfaces the solution.

The earned insight from this pattern: AI as rubber duck is most useful when the human already has the knowledge but can’t find the right framing. The model isn’t adding expertise — it’s providing the questions that help organize thinking that already exists. When Pattern B works, the conversation ends with the human saying “right, I knew that” rather than “I learned something new.” That’s not a failure of the pattern; that’s the pattern working.

Pattern B is also the right tool when the problem isn’t fully formed. If the human doesn’t know whether the issue is the architecture, the test strategy, or the data shape, trying to write a Pattern A prompt is premature. The rubber duck pass identifies which problem is actually worth solving before any code is written.

The failure mode here is the model jumping ahead. Most AI coding tools are trained to produce solutions. Left to default behavior, the model will answer an architectural question with a recommendation before asking any clarifying question. That recommendation might be technically sound, but it short-circuits the thinking process. The human accepts a proposed answer without developing their own understanding of why it’s right — and later, when the solution breaks or needs to change, that understanding isn’t there.

To use Pattern B effectively, steer explicitly: “Don’t propose a solution yet. Ask me questions until I have enough information to decide.” Some models resist this and will produce a question and a solution in the same response. If that happens, ignore the solution, answer the question, and repeat. The thinking is what matters; the solution can come from Pattern A once the thinking is done.

Pattern C: AI as second senior

For genuinely contested architectural decisions, Pattern C works differently from the others. The human has formed an opinion. The goal is not execution (Pattern A) and not finding the answer through reflection (Pattern B). The goal is stress-testing a position before committing to it.

Concretely: “I’m going to build the pricing page generation as a static route that runs at build time with getStaticPaths, not as a server-rendered route. My reasoning: build time goes up, but runtime latency is zero and there’s no edge function to debug. Pricing data changes maybe once a month. What’s the strongest case against this?”

The phrase “strongest case against” is the pivot. It forces the model to steelman the opposite view rather than validate the stated preference. The responses worth paying attention to are the ones that surface constraints the human missed:

“If the tool catalog grows to 2000 entries each with a pricing variant, does the build-time tradeoff still hold?”
“Static works until a pricing update needs to go live in the next ten minutes — what does deployment look like in that scenario?”
“What happens if a tool’s pricing page needs to show real-time availability?”

These are the same questions a senior engineer in a code review would ask before approving an approach. The model isn’t deciding anything — it’s helping the human find the edges of their own reasoning before any code is written.

Pattern C also works well when comparing two approaches the human is already weighing. “I’m choosing between approach X and approach Y. I have a mild preference for X. Argue for Y as strongly as you can, then tell me where that argument breaks down.” This produces a more honest evaluation than asking “which approach is better?” because it prevents the model from simply validating the stated preference.

The failure mode in Pattern C is sycophancy. Most AI models, when they sense the human is committed to a direction, will agree more than is warranted. They raise mild objections, acknowledge the human’s reasoning generously, and land somewhere close to “sounds reasonable.” The “strongest case against” framing reduces this tendency but doesn’t eliminate it.

A stronger prompt for getting genuine pushback: “Assume you’re a senior engineer at a company with very different constraints — say, a team that updates pricing five times a day and has eight engineers. Would they make the same decision? Why or why not?” Shifting the context breaks the model out of the local frame and tends to produce more substantive disagreement. The human then decides whether those different constraints are actually relevant to their situation. Often they aren’t, but occasionally they reveal a real edge case worth addressing now rather than later.

The failure: AI as the lead

All three patterns share one invariant: the human makes the final call. In Pattern A, the human specifies; in Pattern B, the human discovers their own answer; in Pattern C, the human stress-tests their own judgment. The model is a tool in each case.

The failure mode that sits outside all three patterns is AI as the lead — letting the model make architectural decisions that the human then implements.

It looks like this: “What’s the best way to structure the data layer for this tool?” The model answers with a recommendation. The human, without a strong prior opinion, accepts it and builds accordingly. The recommendation was plausible and well-explained. Three weeks later, the consequence surfaces — a query pattern that becomes slow at scale, a separation of concerns that makes the next feature awkward to add, a caching assumption that breaks under a specific data shape.

The core problem isn’t that AI models are wrong. They’re often right in a technical sense. The problem is that “technically plausible” is not the same as “right for this codebase, this team, these constraints.” The model doesn’t know:

What you regret about the last architecture decision
Which patterns your team finds hard to debug at 2am
That your deployment target has a constraint that rules out a whole class of solutions
What the next three features are likely to be

It will sound confident about a recommendation that ignores all of this, because confidence is the default register of AI output regardless of how much context the model actually has.

The dynamic compounds over time. A single deferred architectural decision is a small thing. A codebase where architectural choices have consistently been made by following model recommendations is a codebase shaped by the model’s priors rather than the team’s judgment. Those priors come from the training corpus — the aggregated style of code the model has seen at scale — not from the specific context of the project. The codebase starts to look like a plausible generic example of the stack rather than a deliberate, intentional system.

The asymmetry that matters: the model can reason about tradeoffs across a wide range of considerations, often more patiently than a human can in a ten-minute thinking session. That reasoning is genuinely valuable as input. The mistake is treating it as authoritative rather than as one more data point that the human evaluates and decides on.

Protecting human agency

The habits that keep the human in the navigator seat are small and don’t add much overhead:

Specify before prompting. For any Pattern A task, write down in one or two sentences what the output should look like before opening the chat window. This forces the judgment step to happen before the model acts, not after reviewing what the model produced. The alternative — prompting and then deciding whether to accept — inverts the order in a way that makes acceptance feel like the path of least resistance, because the output already exists and rejecting it means starting over.

Ask the rubber duck question first. Before any non-trivial implementation task, spend two minutes describing what you’re about to build and why. This surfaces assumptions that the code will encode. If the description surfaces a question, that question is worth answering before the code is written rather than after.

Disagree by default on architecture. When a model makes an architectural recommendation unprompted, the first response should be a question rather than acceptance. “What does this break at scale?” or “What assumption is this most dependent on?” These are the questions that would come up in any code review, and they should come up here too.

Notice when you’re post-hoc justifying. The signal that the model has taken the lead is when the human finds themselves explaining why the model’s choice was correct, rather than evaluating whether it is. That’s rationalization, not reasoning. The moment to notice it is before accepting and implementing, not after three weeks of building on a foundation that wasn’t examined.

Keep a short list of what the model missed. Every production incident or design regret traceable to AI-generated structure is worth a brief note: what didn’t the model know, and could a better prompt have supplied that context? Over time this builds a personal vocabulary for where model judgment reliably falls short — and that vocabulary is more useful than any general advice about AI limitations.

A summary of the patterns and when each fits:

Pattern A — use when you know what to build and need execution speed
Pattern B — use when you know the domain but can’t find the right framing for the problem
Pattern C — use when you have an opinion and need it tested before committing code
None of the above — when the model would be deciding what to build rather than helping you build it

The underlying principle is simpler than any of these habits suggest: the human decides what to build, and AI helps build it faster. When those two things stay in that order, the pair is productive. When they swap, the pair is fast and plausible-sounding, which is harder to diagnose than simply slow — because plausible-looking progress feels like real progress until, suddenly, it doesn’t.