I’ve watched several teams try to use AI coding tools for architecture decisions over the past year. Two adopted them carefully and found they were useful for narrow purposes. The other three pushed harder, asking the AI to “design the migration strategy” or “propose the right caching layer” and got back plausible-sounding answers that were wrong in subtle, expensive ways.
The pattern is consistent enough to be worth examining. AI coding tools are good at code, mediocre at modules, and bad at architecture. Understanding why this is structural — not just a matter of waiting for bigger models — is useful for figuring out where these tools actually belong in a team’s workflow.
What “architecture” means here
To avoid the term ambiguity that derails most discussions: by “architecture” I mean decisions about
- How responsibilities are divided across services or modules
- How data flows between components
- Where state lives and how it’s synchronized
- Which third-party systems to depend on, and how
- Migration strategies between any of the above
These are different from “code-level” decisions like which sort algorithm to use or how to handle null inputs. Architecture decisions are evaluated against goals like maintainability, blast radius, observability, and reversibility. They’re judged years later by people who weren’t in the room.
What AI tools are good at
AI tools are good at things with a clear right answer that’s mostly determined by the immediate context:
- Writing the body of a function whose signature is given
- Filling in obvious patterns (CRUD endpoints, test scaffolds, validation rules)
- Translating between equivalent forms (JSON to TypeScript, REST to GraphQL queries)
- Suggesting names for things that follow naming conventions
- Filling in the next line of code in a well-defined pattern
What these have in common: the right answer is heavily constrained by what’s already there. The AI’s job is pattern-matching against a corpus of similar code.
Why architecture isn’t pattern-matching
Architecture decisions don’t have a “right answer” determined by the immediate context. They have trade-offs evaluated against goals you haven’t fully specified, in a system that doesn’t fully exist yet, against constraints that will emerge later.
Consider: “Should we use a message queue or direct HTTP calls between these two services?” The right answer depends on:
- How often the producer service fails (and whether we’re OK with retries)
- Whether ordering matters
- Whether the consumer needs to scale independently
- How much operational overhead the team can absorb
- Whether observability tooling for the chosen mechanism exists in our stack
- What the team has experience operating
- The reliability targets of the calling service
- Whether downstream effects are idempotent
An AI tool can list these factors. It cannot weigh them, because the weights come from your specific organizational context — and your context isn’t in its training data. The AI will produce an answer that sounds reasonable, anchored on the modal answer for the modal team in its training corpus. That answer is decoupled from your situation.
The “average plausible answer” failure mode
Here’s the failure pattern in concrete form. You ask: “We’re building a checkout flow that needs to call the payment processor and update the order DB. Should the payment call happen in a background job or inline with the API response?”
A typical AI response covers both options, lists pros and cons, and recommends one (usually background jobs because they’re “more scalable”). It’s a plausible answer.
What it doesn’t know:
- Your team has never operated a background job system; the on-call rotation isn’t trained for it
- Your customer support team needs immediate visibility when payments fail; a job queue makes this harder
- Your payment provider is unreliable enough that you’ve decided sync timeout failures are preferable to “stuck in pending” states
- The “scalability” the AI invokes isn’t your bottleneck; you’re at 50 transactions per minute and might never grow past that
The AI’s answer is reasonable advice for the average company. You’re not the average company. The advice is wrong for you, and worse, it’s wrong in a way that’s hard to detect because it sounds right.
Why bigger models won’t fix this
The tempting response is: “Sure, current models are limited, but a smarter model with more context will handle this.”
Two reasons this is unlikely to be true even with much better models:
The information isn’t in the model. The trade-offs that matter for your decision are about your team, your operational context, your reliability targets, your customer expectations. None of this is in any pretraining corpus. You can put some of it in a prompt, but the full picture is more than fits in any context window — and most of it is tacit knowledge nobody has written down.
The decisions aren’t statically correct. Code is mostly static — the function does what it does, regardless of who runs it. Architecture is dynamic — the same decision can be right for one team and wrong for another, right at one stage and wrong at another. A model trained to predict the most likely answer will collapse the variation, picking the modal answer that’s wrong for any specific case.
These aren’t “model gets bigger and figures it out” problems. They’re “the question doesn’t have a model-shaped answer” problems.
Where AI helps with architecture
This isn’t to say AI is useless for architecture work. There are real, narrow uses:
Brainstorming alternatives. “What are the ways to structure this?” produces a list. The list is useful even if the AI can’t pick from it. You evaluate against your context.
Surfacing trade-offs you hadn’t considered. “What could go wrong with this design?” sometimes produces failure modes you didn’t think about. Even when most of the answer is generic, the one or two relevant points are worth the prompt.
Explaining unfamiliar patterns. When you’re considering a system you don’t fully understand (event sourcing, CQRS, saga pattern), AI can explain it in your terms. That’s useful for evaluation, even if the AI can’t tell you whether to adopt it.
Sanity-checking after the fact. “Here’s the design we settled on. What are the risks?” surfaces concerns you can verify against your context. Some are real, most are noise, the ratio is still favorable.
What unifies these: the human is doing the deciding. The AI is a thinking aid, not a decision maker.
The actual failure mode
The teams I’ve watched go wrong with AI on architecture aren’t the ones that asked AI for advice and ignored most of it. They’re the ones that treated the AI’s confident-sounding answer as if it were a senior architect’s call. They didn’t notice the absence of context-specific knowledge because the answer’s tone implied that knowledge.
The real risk isn’t that AI gives bad architectural advice. AI tools have always given mixed-quality advice on hard problems; that’s true of every tool. The risk is that the confidence with which they give it makes it harder to discount than human advice — because humans express uncertainty and AI usually doesn’t.
What to actually do
For architecture decisions, treat AI as you’d treat a junior developer who’s read a lot of textbooks and never run a system in production. Their input is occasionally useful. You don’t let them make the call.
For implementation decisions inside the architecture you’ve chosen, AI is a useful tool. Use it.
The line between architecture and implementation is fuzzier than I’m making it sound. That fuzziness is its own problem — and it’s where the most expensive AI failures happen, in the gray zone where a decision feels like implementation but actually has architectural consequences.
When in doubt, treat it as architecture. The cost of consulting AI for a code-level decision and being wrong is small. The cost of consulting AI for an architecture decision and being wrong is large, and shows up over years. The asymmetry suggests caution.