Aider with Llama 3.1 70B via Groq: 600 tok/s changes how you work
Published 2026-04-07 by Owner
Groq’s LPU hardware serves Llama 3.1 70B at sustained 500-700 tokens per second. For aider, this is a meaningful difference from typical cloud model speeds (50-150 tokens per second). The interactive feel of “type a request, see the diff” is fast enough to feel like local autocomplete instead of a network round-trip.
I’ve been using Groq + aider for the past month. Here’s the practical experience.
Setup
Get an API key at console.groq.com. In .aider.conf.yml:
model: groq/llama-3.1-70b-versatile
weak-model: groq/llama-3.1-8b-instant
edit-format: whole
Set GROQ_API_KEY in your environment. Aider connects directly via the OpenAI-compatible interface.
The edit-format: whole matters. Llama 3.1 doesn’t reliably produce search-replace blocks at the format aider expects. Whole-file edits work better — the model rewrites the entire file rather than producing surgical diffs. For files under 500 lines, this is fine.
Speed in practice
A simple aider turn (read one file, propose a small edit) on Llama 3.1 70B via Groq:
- Time to first token: ~150ms
- Generation rate: ~600 tokens/sec
- Total time for a 200-token response: ~500ms
The same turn on Claude 3.5 Sonnet:
- Time to first token: ~600ms
- Generation rate: ~80 tokens/sec
- Total time for a 200-token response: ~3 seconds
Groq is 6x faster end-to-end on this kind of turn. The difference is in the experience, not just the numbers. With Claude, you wait for the response. With Groq, the response is there before you finish reading what you typed.
Quality reality
Llama 3.1 70B is good. It’s not Claude 3.5 Sonnet. The gap shows up in specific places:
Following long instructions. A .aider.conf.yml with 50 lines of conventions is partially absorbed. Llama follows the first 20-30 lines reliably, gets fuzzier on the rest. Claude follows the whole thing.
Multi-file reasoning. When a task touches 5 files, Llama loses track of files in the middle. The first and last files are usually correct; the middle gets approximate.
Edge case awareness. Llama tends to write the happy path well and miss edge cases. “What if the user is null?” is a question Claude asks itself; Llama asks it less reliably.
Code style consistency. Llama is more variable in style across a session. Claude maintains more consistency.
For tasks that fit Llama’s strengths — single-file refactors, scaffolding, simple test generation — the speed advantage dominates. For tasks that need careful reasoning, you’ll catch yourself wishing for Claude.
Cost
Groq’s pricing on Llama 3.1 70B (at time of writing): $0.59/M input, $0.79/M output. For aider:
- A typical 30-minute session: ~500k input tokens, ~50k output tokens
- Cost: ~$0.30 + $0.04 = $0.34
Compare to Claude 3.5 Sonnet for the same session: ~$1.50.
About 4x cheaper, plus the 6x speed advantage. If your tasks fit Llama’s capabilities, the math is hard to argue with.
When the speed actually matters
Speed isn’t a uniform good. For some tasks, the speed of the model isn’t the bottleneck — your reading and reviewing speed is.
The tasks where Groq’s speed changes my behavior:
Iterative refinement. When I want to try three slightly different versions of an implementation, the cost of trying them is much lower at Groq speeds. With Claude, I commit to one approach because the round-trip cost is high. With Groq, I try alternatives because they’re free in time.
Quick checks. “What does this function do?” “What’s a typical pattern for X?” These are conversational queries. Claude is fine. Groq is faster than my browser tab to Stack Overflow.
Test generation. Generating tests is high-throughput, low-creativity work. Llama at Groq speeds generates a test file in about a second. I review and tweak. Compounds across a day.
Failed paths. When the model produces wrong code and I need to revise the prompt, the loop is faster. I’m more willing to iterate on the prompt instead of fixing the wrong code by hand.
Where the speed doesn’t compound
For tasks where I need to think hard about the result, model speed doesn’t matter. If I’m reviewing a 200-line refactor for correctness, the model could have produced it in 200ms or 20 seconds — my review takes the same time either way. The bottleneck is me, not the model.
For these tasks, Claude’s quality advantage is what matters, not Groq’s speed. The 200ms vs 20 seconds is invisible noise next to the 5 minutes of careful review.
Mixing both
The setup I’ve moved to: Groq for autocomplete-style and exploratory work, Claude for actual production code.
In .aider.conf.yml:
# default
model: claude-3-5-sonnet-20241022
weak-model: claude-3-5-haiku-20241022
Then I have a shell alias for switching to Groq mode:
alias aider-fast='aider --model groq/llama-3.1-70b-versatile --weak-model groq/llama-3.1-8b-instant --edit-format whole'
When I’m exploring or doing rough work, aider-fast. When I’m writing code that’s going to ship, plain aider (Claude).
The mental switch matches the work. Don’t write production code at Groq speeds; the quality drop matters. Don’t write throwaway code at Claude speeds; the cost and latency don’t earn it.
Provider stability
Groq’s service has been reliable in my experience but not at Claude or OpenAI levels. I’ve seen ~3 outages in the past month, each lasting 5-30 minutes. For the kinds of work I use Groq for, this is fine — I just wait or switch to Claude. For mission-critical workflows, you’d want to plan for fallback.
The error responses are clean — Groq returns proper HTTP status codes when overloaded, so aider’s retry logic works correctly. No silent failures.
Worth trying
If you do a meaningful amount of exploratory or iterative work in aider, Groq + Llama 3.1 70B is worth the 30 minutes to set up and try for a week. The speed shift will either change how you work (in which case you’ll keep it) or feel imperceptible (in which case you’ll go back to Claude with no time lost).
For production-shaped work, stay with Claude or GPT-4o. The capability gap is real and matters when the output ships.