Tinker AI
Read reviews
intermediate 6 min read

Aider's --lint-cmd and --test-cmd: closing the AI-feedback loop

Published 2026-05-11 by Owner

Aider’s --lint-cmd and --test-cmd flags are two lines in the docs that substantially change how Aider behaves. Without them, Aider edits files and stops. With them, Aider edits files, runs the command you supplied, reads the output, and — if there are failures — tries to fix them and runs again. That cycle repeats until the command passes or Aider gives up.

This isn’t a cosmetic feature. It changes the economics of the session and the failure modes you have to watch for. Most guides mention the flags in passing; this one focuses on what actually happens inside the loop, where the cost goes, and how to configure it so it converges instead of spinning.

What the flags do

Both flags take a shell command string that Aider runs in the project root after applying each set of edits.

aider --lint-cmd "ruff check --fix" --model gpt-4o
aider --test-cmd "pytest tests/unit" --model gpt-4o

--lint-cmd runs first (if set), then --test-cmd. Either command’s non-zero exit code is treated as a failure and the output is injected back into the conversation as a new message. Aider then produces another edit attempt targeting the reported failures.

You can set both at once:

aider \
  --lint-cmd "ruff check --fix" \
  --test-cmd "pytest tests/unit -x --tb=short" \
  --model gpt-4o

The flags persist for the whole session. If you want lint-on-every-edit behavior permanently, add them to .aider.conf.yml:

lint-cmd: ruff check --fix
test-cmd: pytest tests/unit -x --tb=short

The --lint-cmd string is executed with the shell, so you can chain commands: "ruff check --fix && black --check .". Aider doesn’t parse the command — it runs it, captures stdout and stderr, and checks the exit code. Exit 0 means success; anything else means failure and triggers the retry loop.

One important detail: --lint-cmd is expected to auto-fix in place. Aider reruns the command after applying its edits, so a command that only reports issues without modifying files will report the same issues on every retry. ruff check (no --fix) will loop indefinitely on a fixable violation; ruff check --fix will fix it and then exit 0.

The auto-fix loop

The sequence on every accepted edit is:

  1. Aider applies diffs to the working tree
  2. --lint-cmd runs; if exit code is non-zero, output appended to chat
  3. Aider reads the lint failure, produces new diffs to address it
  4. Diffs applied; lint runs again
  5. If lint now passes (or was never set), --test-cmd runs
  6. If tests fail, output appended to chat; Aider tries again
  7. Loop continues until both commands exit 0, or Aider hits its retry limit

This is a genuine closed loop, not a one-shot. Each attempt is a full model turn with the failure output in context. Aider typically resolves straightforward lint errors (unused imports, line-length violations, missing type annotations) in one or two turns. Test failures require the model to understand what the test expects, which takes longer and costs more.

The loop stops when the commands pass. If they never pass, Aider will eventually exhaust its attempts and report that it could not fix the remaining issues, leaving the working tree in its last attempted state.

One nuance: Aider considers a lint pass before running tests. If lint passes on retry two but tests fail on retry two, the test retry counter starts from one — they don’t share a single retry budget. In practice this means a session with both flags set can do more total turns than you might expect from looking at either flag individually.

The cost

Each retry turn costs tokens. The failure output from the command is included in context, plus the conversation history up to that point, plus the new response. On a lint-noisy codebase — one where the linter reports 20 issues per file touched — a single Aider edit can trigger 3–5 retry turns before settling.

For a concrete illustration: editing a file that triggers 15 ruff violations might cost one turn for the main edit plus three turns to chase down the lint issues. At gpt-4o pricing, what would be a $0.04 edit becomes a $0.16–0.20 edit. Multiply that across a session where you accept 20 suggestions and the session cost is 3–4x what you’d expect without lint enabled.

Test retries are even more expensive. A pytest failure includes tracebacks, assertion diffs, and captured output. That context block can be 2,000–5,000 tokens per failure. If the test failure requires understanding a data fixture or mocking pattern the model doesn’t have in context, Aider will generate several wrong guesses before either fixing it or stalling.

The tradeoff is real: you pay more per edit, but you get edits that already pass lint and tests when you review them. Whether that’s worth it depends on how much time you spend cleaning up Aider’s output manually.

There’s a category of task where this math is clearly positive: writing new functions or classes from scratch. Aider generates the implementation, lint catches style issues, tests verify the contract — three layers of validation before the code reaches a human reviewer. The alternative is the human reviewing raw Aider output and running lint/tests by hand. For new code, the automated loop is usually faster and cheaper than manual cleanup.

The math is less clear for refactoring. Changing an existing function signature, for example, cascades through callers. Lint may pass immediately. Tests will likely fail in several files. Each test failure requires Aider to understand the calling context, which may not be in the session’s context window. The loop burns tokens on guesses. For refactors, consider running lint/test flags disabled and handling the breakage manually after reviewing Aider’s structural changes.

Making it idempotent

The feedback loop only converges if the lint command is capable of terminating in a passing state. Commands that auto-fix (ruff check --fix, eslint --fix) cooperate well because running the same command twice on a fixed file should exit 0. The model edits; the linter fixes what it can; the linter runs again and finds nothing.

Commands that report-only and require human judgment do not cooperate. mypy --strict on a dynamically-typed codebase will generate a fresh list of errors every time Aider tries to fix the previous ones. The model patches one Any annotation; mypy reveals three more it introduced. The loop runs until retry limit, not until passing.

Lint configs that cooperate with the loop:

  • Auto-fixable rules only. If your ruff config has rules that require manual decisions (like TCH import categorization), those will loop. Restrict --lint-cmd to rules ruff can fix itself: ruff check --fix --select E,W,F,I.
  • A focused test subset. pytest tests/unit -x --tb=short runs fast and fails fast. Running the full suite on every edit is slow and gives the model too much failure text to reason about at once.
  • Exit 0 on no relevant issues. If your lint command exits non-zero for warnings you don’t care about fixing, you’ll pay for retry turns on issues you never intended to fix.

The principle: if a human couldn’t fix the error by only looking at the diff Aider just produced, the model won’t either. Scope the commands to errors that are locally fixable.

Another practical note on JavaScript/TypeScript projects: eslint --fix works the same way as ruff check --fix for this purpose. Combine it with tsc --noEmit as the test command for type-checking, keeping in mind that TypeScript errors can cascade badly on partial edits. A safer pattern is to run eslint --fix as the lint command and reserve tsc for a separate manual check outside the loop.

The anti-pattern: lint that’s too strict

The failure mode looks like this: Aider fixes an unused import (lint error A), which causes ruff to surface a line-length violation on the line above (error B), which when fixed changes a string that mypy now flags as wrong type (error C). Each fix introduces a new issue. The loop never converges.

This happens when:

  • The lint config has rules that conflict with each other on common patterns
  • The test command covers behavior that Aider’s edit broke but cannot easily restore
  • The model lacks context about why a pattern exists (e.g., a rule exception comment that was deleted)

When the loop stalls, the right move is to stop Aider, run the commands manually, and triage the real issue. Letting Aider loop further will not help — it is not making progress, it is spending tokens on wrong guesses. Check what state the working tree is in (git diff), revert if necessary, and restart with a more constrained command or a narrower task scope.

Watching Aider’s output during a session with these flags is useful for a few minutes while getting familiar. You’ll see lines like Ran lint, found 3 issues and Attempting to fix lint errors... between edits. If you see the same lint error type appearing on three consecutive retry turns, the config is not converging and you should interrupt. That pattern — same error, more attempts — is reliably a sign that the loop will not self-resolve.

A useful circuit-breaker: set --no-auto-commits when using lint/test commands, so the working tree doesn’t accumulate partially-fixed commits during a looping session. Review the final state in one diff before committing.

The setup worth starting with

If the flags are new to you, start narrow:

aider \
  --lint-cmd "ruff check --fix --select E,F,I" \
  --test-cmd "pytest tests/unit -x -q" \
  --no-auto-commits \
  --model gpt-4o

E,F,I covers pycodestyle errors, pyflakes, and isort — all auto-fixable, low false-positive rate. A single targeted test directory with -x (stop on first failure) and -q (minimal output) keeps the feedback text small enough for the model to act on. --no-auto-commits gives a clean review checkpoint.

Once the loop is working reliably for that scope, expand. Add more rule sets; add more test directories. Each expansion increases the cost-per-edit but also increases the confidence level of the output.

The flags are most valuable on green-field additions (new functions, new modules) where the model can generate passing code from scratch. They’re least valuable on deep refactors where fixing one thing breaks several others — there, the loop runs hot and the cost-per-accepted-change climbs fast.

For teams: if the repo already has a CI pipeline with lint and tests, --lint-cmd and --test-cmd are essentially running a local version of CI after every Aider edit. The difference from CI is latency — local is faster — and granularity — Aider gets to fix issues before they ever hit a commit. Whether that’s worth the token cost is a judgment call, but teams that already enforce CI tend to benefit more from these flags than teams that don’t, because the model’s output has to meet a bar that’s already defined.