I joined a 3-engineer startup last spring. Each engineer had been using AI coding tools individually for about a year. The codebase was small (about 35k lines of TypeScript) but already showing signs of a problem we eventually traced to inconsistent AI tool use across the team.
This is the breakdown of what was wrong, what we tried, and what stuck.
The symptoms
In my first two weeks reading the codebase, things that bothered me:
- Three different React patterns for similar UI: one engineer’s components used
useReducer, another’s useduseStatechains, the third’s used a custom hook abstraction. None of these is wrong; together they made the codebase feel incoherent. - Inconsistent error handling: typed errors in some modules,
Result<T, E>tuples in others, exception-throwing in a third. Same thing, three forms. - Test coverage that varied wildly by file: some directories at 90%+, some at 20%, no obvious pattern.
- PR descriptions of dramatically different quality: some terse, some essay-length, some with the same structure as code-generation prompts (clearly LLM-drafted).
The team was shipping. The team also had a bug rate that felt high for a 35k-line codebase, and review took longer than I expected because every PR was stylistically different.
After a week of reading and asking, I realized the inconsistency tracked to AI tool use:
- Engineer A used Cursor with no rules files, prompted in short bursts
- Engineer B used Cursor with their own personal rules file (not shared)
- Engineer C used Copilot Chat plus occasional Aider
Each engineer’s output was internally consistent. The team’s output was not.
The diagnosis
The hypothesis: our team-level inconsistency wasn’t because the engineers were bad at AI tools. It was because each engineer’s tools were tuned to that engineer’s preferences, and the codebase showed the divergence.
This wasn’t unique to AI. Engineers have always had personal preferences that show up in their code. What changed with AI tools was that those preferences got amplified — the AI’s output reflected each engineer’s prompting style, multiplying their stylistic divergence.
The fix had to be team-level conventions, applied through team-level tooling, not individual reform.
What we tried
We had four working sessions over two weeks. The output was a small set of artifacts that became standard for the team.
1. Shared Cursor rules
We standardized on Cursor as the primary AI editor. Engineer C kept Copilot for inline completions but adopted Cursor for chat and Composer work, since that’s where the divergence had been worst.
We wrote a shared .cursor/rules/ directory with four files:
general.mdc— TypeScript style, error handling, naming conventionsreact-components.mdc— component patterns, hooks usage, state choiceapi-routes.mdc— backend patterns for our Next.js routestests.mdc— Vitest patterns, what we test and how
About 180 lines total across the four files. Took us about 4 hours of debate to write. Most of the debate was about what we wanted, not about what to write — the writing was easy once we agreed.
The rules went in version control and applied to everyone immediately. New code from any engineer started looking similar within a week.
2. Shared prompt templates
A team/prompts/ directory in the repo with markdown templates for common tasks:
new-component.md— template for “build me a React component”new-route.md— template for “build me a backend route”migration.md— template for “migrate code from X to Y”bug-investigation.md— template for “investigate a bug given a repro”
Each template had a brief, an example, and a list of common pitfalls. Engineers could copy a template, fill in the specifics, and prompt with the result.
This sounds like overkill. In practice, engineers often forgot to include important context (constraints, file scope, what not to touch) and the templates were a checklist that prevented the omission. PR-quality went up noticeably after these were in regular use.
3. PR description template (AI-aware)
A .github/pull_request_template.md with sections aimed at AI-augmented work:
## What this changes
(brief: what user-facing or internal behavior is different)
## How it was made
(if AI tools were involved heavily, mention which and what role they played —
this helps reviewers calibrate review depth)
## Tested
- [ ] Unit tests pass locally
- [ ] Manual testing of the affected flow
- [ ] Edge cases I considered: ...
## Reviewer notes
(anything specific reviewers should look at carefully)
The “How it was made” section was the new addition. Engineers learned to flag when AI did the bulk of a change vs. when it was mostly human work. Reviewers calibrated accordingly — heavier review on AI-heavy PRs, especially in security-sensitive areas.
4. Weekly tool retro
15 minutes every Friday: what AI tool moments worked this week, what didn’t, what new prompt or pattern would be worth sharing. No agenda beyond that.
Surprisingly useful. Engineers pick up patterns from each other faster in a structured 15 minutes than they would by osmosis. We stopped reinventing the same prompt patterns independently.
What changed in the metrics
Before / after comparison over a 2-month window each side:
| Metric | Before | After |
|---|---|---|
| Avg time from PR open to merge | 1.8 days | 1.3 days |
| PRs requiring substantial revision before merge | ~35% | ~17% |
| Bugs reported in production per week | 4-5 | 2-3 |
| Hours of “fix style” PR comments per engineer per week | ~3 | ~0.5 |
The 25% improvement in time-to-merge isn’t dramatic, but it compounded. The bug-rate halving was more meaningful — fewer hot-fix days, less context-switching, more sustained focus.
The reduction in style-fix PR comments was the most visible. Reviewing a teammate’s PR became reviewing the substance, not reviewing whether they used const correctly. The cognitive load of review dropped substantially.
What I’d warn other small teams about
A few patterns that didn’t work:
Trying to enforce style with linters alone. ESLint rules can catch a lot, but they can’t catch “use useState here, not useReducer” or “this should be a custom hook.” Conventions in writing, supplemented by linting, beat linting alone.
Not standardizing the tool. Engineer C’s hybrid Copilot + Cursor + Aider was where most of the team divergence came from. Eventually he agreed to standardize on Cursor for the same kinds of work as the rest of us. Aider stayed in his toolkit for specific refactor work, which we agreed was a fine specialty use.
Writing rules too long. Our first version of general.mdc was 250 lines. Cursor effectively summarized it internally and the specific guidance got fuzzy. Trimmed to 60 lines, the rules became more enforceable.
Skipping the PR description template. We initially thought “everyone knows what a PR description should be.” The template formalized something that was leaking quality without us noticing. Worth writing down.
Not having the weekly retro. The first month after the standardization, things improved. Then they started backsliding — engineers diverging again, rules getting forgotten. The weekly 15-minute retro caught the drift before it became regression.
What this looks like for a smaller team
A 2-engineer team probably needs less of this. The communication overhead is lower; you both know what each other is doing.
A 5+ engineer team needs more — probably with a dedicated “AI tooling lead” who maintains the rules and templates as a part-time responsibility.
For 3 engineers specifically, the four artifacts above (rules, prompt templates, PR template, weekly retro) was the right amount. Light enough to maintain, structured enough to produce consistency.
What I’d carry to the next team
The patterns that I’m planning to use again on future projects:
Day-1 rules file. I’ll bring a templated .cursor/rules/ from previous projects and adapt to the new codebase. Saves the 4-hour debate when you start with something to react to instead of from blank.
Prompt templates. Even for solo work, I’ve started using these. They’re checklists for myself and they reduce my “forgot to include context” rate.
PR template with AI flagging. Belongs everywhere. The flag itself doesn’t have to do anything; the act of writing it makes you think about it.
Weekly tool retro. Even for a solo developer, 15 minutes on Friday “what worked this week with AI” creates a habit of reflection. Worth it.
The bigger lesson: AI tools amplify whatever team-level disciplines you have or don’t have. A team with strong shared conventions before AI gets even more from AI. A team with weak shared conventions gets a worse result with AI than they would have without it, because the AI amplifies the divergence.
This is the unsexy version of the AI-tool story. Tools matter; the human practices around them matter more. Three engineers using Cursor without conventions are worse off than three engineers not using AI but having shared conventions. Three engineers using Cursor with good conventions are meaningfully more productive than either. The conventions are doing most of the work.