Where AI coding tools stop helping on legacy code

Published 2026-05-11 by Owner

Greenfield code is where AI coding tools look their best. Empty directory, clear requirements, modern conventions — the model’s training data maps neatly onto the task. The suggestions land, the structure holds, and the loop is fast.

Legacy code breaks this in four specific ways. Not because AI tools are bad, but because what they’re trained on is a poor match for what legacy code actually is. The failure modes are predictable once you know them. And they cluster.

The four things that confuse AI on legacy

Custom DSLs and internal frameworks. Lots of mature codebases have internal domain-specific languages, query builders, templating systems, or workflow engines written before the current generation of libraries existed. The AI hasn’t seen these. When it encounters them, it tends to treat them as malformed versions of something it does know — a template syntax that looks almost like Jinja, a query builder that looks almost like Knex. The suggestions it makes are plausible-looking and wrong. The model is pattern-matching to the closest known thing, which isn’t the actual thing.

This is not the AI hallucinating in the pejorative sense. It’s doing what it was trained to do — recognize patterns and extend them. The problem is that a bespoke 2011 query DSL doesn’t follow the patterns that dominate its training data. So the extensions it proposes fail at runtime.

Undocumented invariants. A 15-year-old codebase accumulates rules that aren’t in comments, aren’t in docs, and aren’t in tests. They exist in the code’s behavior and in the heads of people who’ve been there long enough. “The state field must be written before status because the billing integration polls status changes and state is the canonical source.” No comment. No test. Just a code path that breaks in production if the ordering changes. AI tools cannot know these invariants exist. They’ll happily suggest changes that violate them, and the violation will look like a clean refactor.

This is the deepest problem. The AI can read everything visible and still not know. The knowledge it’s missing was never written down because the people who knew it didn’t think to write it down, or left before they could, or assumed the next person would figure it out the hard way. The AI is in the same position as a contractor dropped into a codebase with no orientation. Except the contractor at least knows they don’t know.

Dead code that looks alive. Modern IDEs and type systems catch most dead code. Legacy systems — especially dynamic-language codebases — don’t. There are functions invoked via string dispatch. There are methods called by a metaprogramming layer that no static analysis understands. There are config keys consumed by a service that’s no longer deployed but whose config format is still loaded on startup. The AI sees the code, sees no callers (because no static callers exist), and flags it as dead. Sometimes it’s right. Sometimes it’s the load-bearing pillar disguised as scaffolding.

The risk here isn’t just that the AI flags real code as dead. It’s that the AI will sometimes propose to clean up the “dead” code as part of another change. The cleanup is embedded in a diff that looks like it’s doing something else entirely. Reviewers miss it. It ships. And three weeks later, something fails in a way that takes days to connect back to the removal.

Ancient conventions that clash with modern training data. An early-2010s Ruby codebase does not look like what a model trained on recent GitHub repos expects. A 2008 PHP application does not look like modern PHP. The error-handling style, the class structure, the database interaction patterns — all of it registers as “code that could be improved” to a model whose training corpus skews toward recent open-source work. The AI’s instinct is to fix what it sees. That’s where the trouble starts.

This is worth sitting with: the AI’s suggestions for “improvements” on old code aren’t random. They’re confident, specific, and would genuinely improve greenfield code. The problem is that the code in question is not greenfield. It’s load-bearing in ways that aren’t visible from the code itself. The AI’s suggestions are right in general and wrong in context.

The modernization trap

The single most expensive mistake AI tools make on legacy code: being asked to touch one thing and deciding to modernize the surrounding area.

An example that illustrates the pattern. A team needed to update a single API endpoint handler in a 12-year-old Express application. The handler used callback-style error handling — (err, result) callbacks throughout — because the app predated widespread async/await adoption. The task was simple: add a new X-Request-ID header to responses from that endpoint.

Cursor did that. It also converted the callback chain to async/await, extracted some inline logic to helpers, renamed a few variables to match modern conventions, and reformatted the file. The diff was clean. It looked like an improvement.

The header was added correctly. The async/await conversion broke production within 12 hours. Two of the callbacks in that chain assumed synchronous error propagation behavior that async/await doesn’t replicate in the same way. The tests didn’t catch this because the test suite predated that assumption being a critical invariant. The test suite passed; production didn’t.

The header addition was five lines. The modernization was 80 lines of changes across a module the team didn’t fully understand. The combination was hard to test in the time available and expensive to debug once it failed.

This pattern repeats: AI sees old code, pattern-matches to “this doesn’t look right,” and expands scope beyond the original task. The scope expansion is the dangerous part. The original task was right-sized. The AI’s version wasn’t.

There’s an asymmetry that makes this worse: scope expansion on greenfield code is annoying but usually safe. The code is recent, the team understands it, the tests cover it. Scope expansion on legacy code is dangerous because the team often does not understand it fully, the tests don’t cover the invariants that matter, and “looks better” is not the same as “is equivalent.”

The right scope on legacy

The rule that consistently reduces incidents: on legacy code, the correct unit of AI-assisted change is small, isolated, and contains no rewrites.

Concretely:

One function, one concern. Don’t ask the AI to change a function and let it decide what “nearby” code also needs work. Constrain the target explicitly.
No style changes bundled with logic changes. If the change is logic, leave the style alone. If the change is style (renaming, formatting), leave the logic alone. Mixing them makes the diff unreviable and removes your ability to test the logic change in isolation.
Explicitly prohibit scope expansion in the prompt. “Only add the header field. Do not change any other part of this function. Do not convert callback patterns. Do not rename variables.” Without these constraints, the model treats them as optional. They’re not optional on legacy code.
Review every line of the diff before accepting. On greenfield code, it’s reasonable to skim diffs and spot-check. On legacy code, every line matters because any line could be the one that breaks an invariant you didn’t know existed. The AI doesn’t get the benefit of the doubt on legacy the way it does on fresh code.
Treat “fix while I’m in here” impulses with suspicion. AI will notice adjacent things that look suboptimal and offer to address them. Each suggestion might be correct in isolation. On legacy code, “correct in isolation” and “safe to change” are different questions. The AI can only answer the first.

Small scope also means reviewable scope. A five-line diff in a 2009 Java class can be understood by someone who hasn’t lived in it. An 80-line diff in that same class cannot be, not on a tight timeline and not by someone who doesn’t know the invariants.

There’s also a psychological dimension to this. When an AI produces a large, clean-looking diff, it’s easy to feel like the work is done and safe. The diff is coherent; the tests pass; the AI was confident. That confidence is contagious and often incorrect on legacy code. Keeping scope small keeps the scope of possible error small, which keeps the pressure to skip verification manageable.

Using AI as a read-only archaeologist first

The sequence that consistently works: read before write. Understand before change.

Before asking AI to modify anything in a legacy codebase, use it to explain what exists. This is the AI’s most reliable mode on legacy — reading and summarizing — and the one with essentially zero downside risk. No change, no breakage.

A practical set of archaeology prompts:

Before I make any changes: explain what this function does, who calls it,
and what assumptions seem to be relied on by the callers. Don't suggest
changes. Just describe what's there.

Walk me through the data flow from when a request hits [EntryPoint] to
when [Outcome] happens. Trace only what exists. Don't suggest improvements.
Focus on what might be surprising to someone not familiar with this code.

List every place in this module that looks like it might be a load-bearing
assumption — something that would silently break if changed. Don't fix
anything; just identify. Include things that look like dead code but might
not be.

This function uses a pattern I don't recognize. Describe what it's doing
without assuming it's wrong. What would make sense here given the surrounding
context and the era this code was likely written?

The “don’t suggest changes” and “don’t fix anything” clauses matter. Without them, AI tools frequently mix explanation with recommendation. The explanation is valuable. The recommendation, on legacy code, is often the dangerous part. Keep them separate.

A subtler benefit of the read-first sequence: it forces a conversation before any change happens. You might discover the code does something you didn’t expect. You might discover the function you were about to modify is called from seven places you didn’t know about. You might discover the AI can’t explain it consistently — which is itself a signal worth having before writing anything. Spending 15 minutes on archaeology before touching code has, more than once, prevented me from making a change that would have been correct-looking and wrong.

After the archaeology pass, the picture is clearer: here’s what the code does, here’s what the AI identifies as potentially load-bearing, here’s what it thinks it understands well versus poorly. With that picture in hand, the decision about whether to change anything — and how specifically to scope the change — becomes a human decision informed by better context. The AI produced the map. The engineer decides where to walk and whether the destination is worth the trip.

When to put the AI away

There’s a class of legacy code where AI assistance is net-negative: code that’s too entangled for changes to be safely isolated, where the model’s uncertainty about what’s load-bearing is high, and where a wrong suggestion costs more in debugging and incidents than the saved time is worth.

Signals that this threshold has been crossed:

Multiple conflicting calling conventions in one module. The module was updated by multiple people across multiple eras, each with different styles and assumptions. The AI sees the mix and can’t tell which convention was deliberate and which was accidental. Its suggestions will be locally consistent and globally incoherent.

Behavior driven by runtime state, database values, or environment rather than code. The AI reads the code and misses that actual behavior depends on something external it can’t see. Configuration tables, feature flags set in the database, environment-specific overrides applied at startup — the code looks one way; the system behaves another. Any suggestion the AI makes will be built on an incomplete model of what the code actually does when it runs.

Tests that test the wrong thing. Legacy test suites often test implementation detail rather than observable behavior. Green tests mean the implementation is the same, not that the behavior is correct. AI tools can’t distinguish these two things. They’ll report that tests pass and infer that the change is safe, which is not the same inference.

The AI’s own explanation contradicts itself. If asking the same module to be explained twice produces substantially different answers, the model has low confidence in what it’s reading. Low confidence plus code changes is a high-incident combination. The uncertainty signal is real: when the AI isn’t sure what the code does, it shouldn’t be proposing changes to it, and neither should you be accepting them.

The domain has a specialized vocabulary the AI doesn’t know. Finance, insurance, logistics, healthcare, manufacturing — any vertical with deep domain concepts tends to encode those concepts into code in ways that look like jargon to a generalist model. The AI can read the syntax but misses the semantics. “Premium calculation” in insurance means something precise; a model trained on general code will treat it as a generic concept and get the invariants around it wrong.

In these cases, the useful AI role is narrow and read-only: explain this isolated code block in isolation, or summarize what this specific function appears to do. Not propose changes, not identify what’s safe to modify, not refactor across files.

Treat these signals as a prompt to slow down rather than a reason to abandon AI entirely. The correct response is a narrower question, not a broader one.

The session structure looks different from standard AI-assisted development. AI reads one piece and summarizes. Engineer reads the surrounding context themselves and forms a judgment. Engineer writes the change. AI reviews the diff afterward if useful. The AI’s role at the writing step is minimal or absent.

This is a smaller role than the tools are usually marketed as having. It’s also the role where they stay net-positive on systems that have survived long enough to develop the kind of accumulated complexity that resists outside reasoning.

None of this means AI tools are useless on legacy code. It means they’re useful in a different way than on greenfield work. Archaeology is genuinely valuable. Summarization is genuinely valuable. Targeted small changes with explicit scope constraints are genuinely valuable. The mistake is reaching for the same loop — describe task, accept output, ship — that works on fresh code, and expecting it to hold on code that has decades of invisible weight behind it.

The earned insight from working on both greenfield and legacy code with AI tools: legacy code didn’t outlive its era by being easy to understand. It survived by being difficult enough to modify incorrectly that most incorrect modifications got caught before they stuck. AI tools don’t automatically inherit that caution. Building it into the workflow — explicit scope constraints, read before write, human judgment on what’s load-bearing — is what makes AI assistance safe on systems that matter. It’s slower than the AI-assisted greenfield loop. It’s faster and safer than the alternative.