Using AI tools for legacy code archaeology: explaining what 2014-vintage code does

Published 2026-04-08 by Owner

A friend joined a company last fall and inherited a Rails application originally written in 2014. The codebase has been actively developed for 11 years. There’s no documentation worth reading. The original authors are long gone. The current team has tribal knowledge but it’s incomplete and contradictory in places.

This is the situation AI coding tools are most underrated for. Not for fixing the legacy — they’re not good at that — but for understanding what it does. Three months in, my friend is productive on parts of the codebase that took prior new-hires six months to crack.

This is the workflow that helped him.

Why this works

Legacy archaeology has a specific shape: you’re trying to build a mental model of a system from artifacts. The code is one artifact. The git history is another. Old comments, old tests, old config — each is a partial signal about what the system does and why.

AI tools are good at partial-signal synthesis when given enough material to work with. They can read across files, notice patterns, summarize behavior, and answer specific questions about code. They’re not good at producing correct architectural advice for legacy refactoring (the cost of being wrong is too high), but they’re well-suited to “explain what this is doing.”

The trick is asking the right questions, in the right order, with the right context.

The first-week archaeology workflow

When my friend started, the first week was mostly reading code. The structure that worked for him:

Day 1: orientation via repo structure

The opening prompt to Cursor (with the entire repo loaded as project context):

@workspace I'm new to this codebase. Without making any changes, give me an 
orientation:
- What's the top-level structure of this project?
- What are the main domains or business areas?
- What's the entry point for HTTP requests, background jobs, and other ways 
  the code can be invoked?
- What's the most-edited area in recent commits (top 10 files)?
- What's the least-edited area (oldest unchanged files)?
- What patterns or conventions are most common (style, architecture)?

Cursor produced a structured response covering each. About 70% of it matched what the team confirmed manually. The 30% that was wrong was at least a starting point for asking specific questions.

This took the AI maybe 90 seconds. Doing the equivalent by hand — running commands, exploring directories, reading enough files to summarize — would have been a half-day. The compression of the orientation phase is where AI is most clearly net-positive.

Day 2-3: domain by domain

For each domain identified in day 1, a focused prompt:

@src/billing Walk me through the billing module:
- What does it model? (Customers, subscriptions, payments, etc.)
- What are the main classes/models and their relationships?
- What external services does it integrate with?
- What scheduled jobs run, and when?
- Where are the integration points with other modules?
- What looks like dead code or no-longer-used patterns?

Repeated for each major domain. The “dead code” question was particularly valuable — the AI flagged about 8% of the billing code as having suspicious patterns (no callers, very old without changes, comments mentioning deprecated systems). Most of these were confirmed as dead by the team. Removing them was a quick win.

Day 4-5: specific questions

By day 4 my friend had a rough mental model and could ask sharper questions:

What does PaymentRetryQueue actually do? It's invoked from three places that I 
can see. Are those the only callers? What's the retry policy? What happens if 
the payment provider is down for an extended period?

Cursor’s answer involved reading PaymentRetryQueue, finding the callers, tracing the configuration, and producing a coherent narrative. The narrative was checkable: my friend could verify each claim by reading the relevant code, and most claims were correct.

This is the level where AI tools really earn their keep on legacy. The traditional alternative — read every relevant file, hold the model in your head, produce the narrative yourself — takes hours per question. AI compresses it to minutes, with the caveat that you still verify.

The questions that AI handles well on legacy

After watching my friend’s progression and doing similar work myself:

“What does X do?” AI is good at reading X and producing a behavior summary.

“Where is X used?” Both AI and IDE navigation handle this; AI is sometimes better at synthesizing across uses (“called from these 5 places with these patterns”).

“What’s the relationship between X and Y?” AI can read both and identify integration points, shared abstractions, and dependencies.

“What’s deprecated?” AI is good at flagging code that has all the markers of legacy — old comments, no recent changes, callers that don’t exist anymore.

“Why was this written this way?” AI is sometimes good at this when the answer is in the code (style of an era, common pattern at the time, mitigation for a known framework limitation). It’s bad when the answer requires knowing the team’s history.

The questions that AI handles poorly on legacy

“Is this safe to change?” This requires knowing the system’s blast radius and your organization’s tolerance for risk. AI doesn’t know either.

“Why did the original team make this choice?” Often unknowable from the code alone. AI will produce plausible reasons that may or may not be correct.

“What will break if I refactor this?” AI can identify direct callers and tests but can’t predict runtime issues, environmental dependencies, or undocumented assumptions.

“What’s the right way to fix this?” Architectural advice on legacy systems is the AI’s worst category. The right answer depends on context the AI doesn’t have.

For these questions, ask humans. They have the context AI lacks.

A worked example: tracing a bug

A specific incident from my friend’s first month — illustrative of the archaeology workflow:

The bug: under specific conditions, a customer’s invoice was being generated with the wrong tax amount. The conditions were rare; the issue had existed for months without anyone tracking it down.

The investigation, with Cursor as a pair:

The issue: invoices for customers with both a Canadian billing address and 
US-based subsidiary billing show tax calculated at the US rate, but the actual 
charge applies the Canadian rate. The discrepancy appears only on invoices 
generated by the monthly billing job, not on real-time charges.

@src/billing/invoice_generator.rb @src/billing/tax_calculator.rb 
@src/billing/monthly_billing_job.rb

Walk me through how tax is calculated for the monthly job. Where could this 
inconsistency come from?

Cursor produced a 4-step narrative:

The monthly job iterates over customers, calling InvoiceGenerator
InvoiceGenerator calls TaxCalculator with the billing address
TaxCalculator looks up the rate by country code
The rate is applied during invoice creation, but the actual charge happens later via a separate ChargeProcessor

Step 4 was the lead. Cursor flagged: “ChargeProcessor uses the customer’s primary address, not the billing address passed to InvoiceGenerator. For customers with subsidiary billing, these differ.”

This was the bug. My friend confirmed by reading ChargeProcessor and finding the address resolution logic that picked primary over billing. The fix was a one-line change to use the same address resolution as InvoiceGenerator.

Total investigation time: about 90 minutes including the verification. Without AI, similar bugs in this codebase had taken 2-3 days to track down. The compression is real.

The discipline that survives

The verification habit that legacy work demands stays the same with or without AI:

Don’t trust an AI claim about behavior until you’ve checked the code
Don’t refactor based on AI advice about legacy systems without team review
When AI says “X is unused,” verify by checking callers, runtime logs, and team knowledge
When AI says “Y was probably done because Z,” treat the explanation as a hypothesis, not a fact

The verification isn’t optional. It’s where AI’s confident-sounding-but-wrong outputs get caught. Skipping verification on legacy code is how you create incidents.

What this saves

For my friend, three months in:

Time to productivity on a domain: about 2-3 days, vs. the team’s previous baseline of 2-3 weeks for similar new-hires
Tickets closed in first month: about double the team’s typical rate for new hires
Confidence about the codebase: subjectively higher, because the AI-assisted exploration meant he’d actually read more code in 3 months than prior new hires read in 6

The pattern: AI doesn’t make you understand legacy code without effort. It makes the effort more productive per hour. You still have to read. You still have to verify. The AI’s contribution is summarization, navigation, and answering specific questions faster than you could yourself.

For the work of inheriting a legacy codebase — which most engineers do at some point in their career — this is one of the highest-value applications of AI coding tools. Higher than autocomplete, in my opinion. Worth setting up as part of your onboarding workflow if you haven’t already.