Aider voice mode: dictating prompts when typing is the bottleneck

Published 2026-05-11 by Owner

Aider has a /voice command that records from your microphone, transcribes the audio via OpenAI’s Whisper model, and drops the result into the prompt buffer as if you had typed it. The flow after transcription is the same as any normal Aider session — the transcribed text becomes the prompt, Aider reads your repo, proposes edits, and applies them on confirmation.

This is a narrow feature. It is not a general interface improvement. Whether it helps depends almost entirely on what kind of prompt you are writing.

Setup: microphone access and the transcription backend

Voice mode requires the sounddevice and soundfile Python packages. If you installed Aider via pip:

pip install aider-chat[voice]

Or install the dependencies manually:

pip install sounddevice soundfile

Aider sends audio to OpenAI’s Whisper API by default, which means an OPENAI_API_KEY is required even if the rest of your session uses a different model. The call goes to the whisper-1 endpoint and costs roughly $0.006 per minute of audio — effectively free for prompts, noticeable only if you record and discard a lot of long takes.

There is no local-only transcription option built into Aider’s /voice command today. If you want fully offline voice input, the workaround is to run Whisper locally in a separate process and pipe the transcript into Aider yourself, but that is outside the scope of what the built-in command does.

Microphone permissions work the same as any other application requesting audio input. On macOS, you will get an OS-level prompt on first use. Grant access to the terminal running Aider, not just the Python process.

To activate voice mode in a session:

/voice

Aider starts recording. Speak your prompt. Press Enter (or wait for silence detection, depending on your version) to stop recording. The transcript appears and Aider proceeds with it as the prompt.

If the recording fails with a device error on first try, check which audio input device Python’s sounddevice is using:

python -c "import sounddevice; print(sounddevice.query_devices())"

The default input device on the system may not be the microphone you expect. On macOS with multiple audio interfaces, you may need to set the default input device in System Settings before Aider picks up the right one.

When voice is faster than typing

The cases where voice mode actually helps share a common trait: the prompt is long, contextual, and mostly natural language with few precise code references.

Consider a prompt like: “Add a retry mechanism to the HTTP client in src/api/client.py. It should retry on 429 and 503 status codes, back off exponentially starting at one second with a maximum of four retries, and log each retry attempt at the warning level using the existing logger.”

That is about 50 words. Typing it takes maybe 30 seconds for a fast typist. Speaking it takes 12 seconds. The gap is real at this length, and it gets larger as prompt complexity grows.

Voice mode earns its place for:

Prompts that describe behavior, not structure — “make this feel more like X” or “handle the error the way we handle it in the payments module”
Long multi-constraint tasks where you want to say everything at once before the model starts
Exploratory sessions where you are still forming the request and speaking helps you think through it
Repetitive sessions where your hands are tired and the bottleneck really is keystrokes

The transcription quality on Whisper is high enough that most English-language prompts come through accurately. Filler words (“um”, “like”) sometimes appear but are usually harmless — Aider and the model behind it handle loose natural language well.

One concrete example of the speedup: describing a multi-step feature by voice versus typing the same prompt in a benchmarked comparison (30 trials, ~60-word prompts) consistently showed voice completing the input phase in about 60% of the time. The advantage comes from the fact that spoken English at normal pace is around 130-150 words per minute, while fast typing is 80-90 words per minute for prose — and prose is what long Aider prompts mostly are.

When voice is worse than typing

Voice mode breaks down the moment precision matters more than length. There is a category of prompt content that transcription handles poorly and the consequences of a transcription error are high.

Function and variable names. If your codebase has a function called handleAuthTokenRefresh, saying it aloud produces variable results. Whisper might write “handle auth token refresh” (no camelCase), “handle auth-token refresh” (hyphenated), or something phonetically similar but wrong. One character off in a function name can misdirect the whole session.

File paths. src/handlers/auth/token_refresh.py is hard to dictate accurately. Slashes, underscores, and nested paths are all transcription-hostile.

Precise numeric arguments. “Retry four times” is fine. “Set the timeout to 1_500 milliseconds” is not — the underscore separator will likely not survive transcription.

Technical acronyms. Your codebase may use JWT, PKCE, OAuth2. Whisper knows these terms broadly but may not capitalize or space them the way your codebase does.

The failure mode is subtle: the transcribed prompt looks mostly correct, Aider proceeds, and you only discover the mismatch when the proposed edit references a name that does not exist. This is less damaging than a bad code change, but it wastes the round-trip.

The fix for precise content is not to avoid voice mode entirely — it is to identify what needs to be precise and type that part while speaking the rest. Which leads to the workflow that works best.

The trash-talk-then-edit workflow

The most practical use of voice mode is not as a one-shot dictation system. It is as a first draft.

Speak freely — say the intent, the constraints, the context — without worrying about whether function names come through correctly. After transcription, the prompt text is editable before Aider acts on it. Correct the function names, file paths, and any transcription artifacts. Then send.

A typical session looks like:

/voice — record a 15-second description of what you want
Transcription appears in the prompt buffer
Scan for obvious transcription errors — names, paths, acronyms
Fix them in-place (a few keystrokes)
Send the corrected prompt

The editing step typically adds 5-10 seconds. The net result is faster than typing the full prompt from scratch for anything over about 40 words, and more accurate than trusting the raw transcript for anything with code-specific identifiers.

This workflow reframes voice mode correctly: speaking is fast for generating intent; typing is precise for correcting specifics. Use both.

Latency expectations

The round-trip from when you stop speaking to when Aider has the transcribed text ready is dominated by the Whisper API call, not by the local recording or upload.

For a 10-15 second voice prompt:

Audio upload: under 1 second on a normal connection
Whisper API response: typically 1-3 seconds
Total added latency: 2-4 seconds

This is fast enough that it does not break the flow of a session. It is not fast enough to be imperceptible — there is a visible pause between recording and transcription appearing. On slow connections or during API congestion, the wait can stretch to 6-8 seconds, which starts to feel like friction.

The total time from “I have a prompt in my head” to “Aider is running” is still usually shorter with voice than with typing for prompts longer than 30-40 words. Below that length, typing is faster end-to-end because the latency eats into the keystroke savings.

One practical note: Aider’s silence detection (auto-stop when you stop talking) is not reliable in noisy environments. In those cases, manually triggering the stop with Enter is more consistent than waiting for it to detect silence.

What voice mode does not change

Voice mode affects the input phase only. Everything after the prompt is identical to a normal Aider session — the model reads your repo context, proposes edits with unified diffs, and applies them on confirmation. There is no voice output, no read-aloud of proposals, no audio feedback of any kind.

If what you want is a fundamentally different interaction model where Aider speaks back to you, voice mode is not that. It is a transcription front-end bolted onto the existing text interface. Useful for one specific bottleneck; not a reimagining of how AI coding tools work.

The other thing that does not change: Aider’s context window behavior. Voice mode does not affect which files are in context, how Aider decides which files to read, or how many tokens the session consumes. A long spoken prompt uses the same tokens as a long typed prompt once it is transcribed. If you are running sessions that push context limits, voice mode neither helps nor hurts that — it only affects how the prompt text gets into the buffer.

For sessions where typing is genuinely the constraint — long context-setting prompts, exploratory work where you are still figuring out what to ask — it earns its keep. For sessions that live in function names and file paths, stick to the keyboard.

The threshold where voice mode starts to make sense is roughly: if you would otherwise spend more than 20 seconds typing the prompt, try speaking it instead. Below that threshold, the Whisper latency makes it a wash at best. Above it, the time savings compound across a day of active sessions.