Running local models in Cline with Ollama: when it's worth the trouble
Published 2026-05-04 by Owner
Cline plus Ollama gives you an AI coding agent that runs entirely on your machine. No API keys, no per-token billing, no data leaving the laptop. The pitch is appealing for privacy-sensitive work, offline development, or just curiosity about the gap between cloud and local.
I ran this setup as my primary for a week to see what’s actually usable. The setup is easy. The quality gap is real and unavoidable at current model sizes. This guide is about both — getting it working, and knowing where it’s worth using.
Hardware requirements, honestly
The minimum useful setup, in my experience:
- Mac with 36GB+ unified memory (M3 Pro or M4 Pro/Max) — runs 32B-class models smoothly
- Mac with 64GB+ unified memory (M3 Max or M4 Max) — runs 70B-class models, the threshold for “feels OK”
- Linux box with 24GB+ VRAM (RTX 3090, 4090, or 7900 XTX) — same as 36GB Mac, slightly faster
- Linux box with 48GB+ VRAM — 70B-class
Anything below this is going to disappoint. 16GB Mac running 7B models is technically a coding agent, but the quality is closer to 2022-era models than to anything you’d compare to Claude or GPT-4.
I tested on an M3 Max with 64GB. That’s a high-end consumer machine. The performance numbers in this guide reflect that hardware.
Installing Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama serve # runs in foreground, or use the macOS app
Then pull the models you want:
ollama pull qwen2.5-coder:32b # ~19GB, my daily driver for local
ollama pull llama3.1:70b # ~40GB, more general-purpose
ollama pull deepseek-coder:33b # ~18GB, code-focused alternative
The first model download takes time depending on your connection — 20-40 minutes for any of these on a typical home network. Subsequent loads are instant since the model stays cached on disk.
Connecting Cline to Ollama
In Cline’s settings panel:
API Provider: Ollama
Base URL: http://localhost:11434
Model: qwen2.5-coder:32b
The base URL is Ollama’s default. If you’ve changed it, use whatever port you configured. Save and send a test prompt — “What’s 2+2?” is a fine smoke test. If you get a response, the wiring is good.
The quality gap, concretely
I ran the same set of 12 small tasks through Claude 3.5 Sonnet (cloud) and Qwen 2.5 Coder 32B (local). All boilerplate or moderately complex single-file work — the easy half of normal Cline tasks.
| Outcome | Sonnet | Qwen 32B local |
|---|---|---|
| First-pass correct | 9 / 12 | 5 / 12 |
| Correct after one revision | 12 / 12 | 8 / 12 |
| Eventually correct | 12 / 12 | 10 / 12 |
| Avg time per task | 35 sec | 90 sec |
| Avg cost per task | $0.10 | $0 |
For five of the twelve tasks, Qwen produced clean output the first time and there’s no meaningful difference in the result. For three tasks, Qwen needed an extra round of clarification but got there. For two tasks, I gave up and switched to Sonnet — Qwen kept missing a piece I couldn’t get it to incorporate without basically writing the whole answer in the prompt.
The pattern: Qwen is competent for tasks where the answer is similar to many things in its training data. It struggles with tasks that require synthesizing across patterns it hasn’t seen together before.
Where local models are good enough
After a week of mixed use, the categories where I happily use Qwen 32B locally:
Boilerplate generation. “Add a getUserById function in this style” — Qwen handles this cleanly. The pattern is in its training data many times over.
Test scaffolding. Writing the structure of a test file, mocking the obvious things, then I fill in assertions. Qwen does the structural part well.
Refactor renames. “Rename UserContext to AuthContext across these three files.” Qwen handles this fine.
Documentation generation. Pulling JSDoc out of code, writing brief module docstrings. Qwen is roughly as good as Sonnet here.
Working on a flight or in a coffee shop with bad Wi-Fi. Quality drop is real but acceptable when the alternative is no AI assistance at all.
Where local models hit a wall
Multi-file reasoning beyond two or three files. Qwen 32B’s effective context is shorter than Sonnet’s, and even within its context, it pays less attention to distant tokens. Cross-file inference that Sonnet handles routinely produces confusion in Qwen.
Subtle bug diagnosis. “This test fails with X — what’s wrong?” Qwen often picks a plausible cause that isn’t the actual one. Sonnet is meaningfully better at causal reasoning over code.
Tasks requiring up-to-date library knowledge. Qwen’s training cutoff is older than the cloud models, and library APIs change. I had multiple sessions where Qwen suggested deprecated React patterns or pre-Next.js-14 idioms.
Prompt iteration. Each Cline turn re-sends the conversation. With cloud APIs and prompt caching, this is cheap. Locally, every turn re-runs the model from scratch through the whole context. A 6-turn back-and-forth that takes 90 seconds on Sonnet takes 5+ minutes on local.
Performance, real numbers
On the M3 Max 64GB:
| Model | First token | Tokens/sec | Memory used |
|---|---|---|---|
| qwen2.5-coder:32b | 1.2s | ~30 t/s | 19GB |
| llama3.1:70b (Q4) | 2.5s | ~12 t/s | 40GB |
| deepseek-coder:33b | 1.4s | ~25 t/s | 18GB |
| qwen2.5-coder:7b | 0.3s | ~80 t/s | 5GB |
Qwen 32B at 30 tokens/second feels usable for short responses, slow for long ones. Llama 70B at 12 tokens/second is on the edge — long responses test your patience.
The 7B model is fast enough to feel snappy but the quality drop is much larger than the speed gain. Don’t bother unless you have hardware that can’t run 32B.
A realistic hybrid setup
What I actually run after the experiment:
- Cloud Sonnet for daily work — primary, no quality compromise
- Local Qwen 32B configured as a Cline profile — for offline situations and for tasks where I’d rather not send code to a third party
The local profile gets used maybe 5-10% of the time. The rest is cloud. That ratio is honest, not a pitch for local being more capable than it is.
When local actually wins
Three situations where local is unambiguously the right choice:
Code under NDA you can’t send to third-party APIs. Even with vendor data-privacy guarantees, some legal teams say no. Local is the answer.
Air-gapped or offline environments. Government, defense, certain medical contexts. The cloud option doesn’t exist; local is the only option.
Bandwidth-constrained development. I’ve used local Qwen on a slow tethered connection in a hotel because the cloud API was timing out repeatedly. It’s not the fastest setup, but it works without the network.
For everything else — typical commercial development on a fast network — cloud is currently better, faster, and cheaper than the human-time cost of working around local model limitations.
That balance will shift as local models improve. It’s already meaningfully better than it was a year ago. Watch the space, but don’t pretend it’s already at parity.