Cline + Ollama for fully self-hosted AI coding: realistic expectations

Published 2026-04-17 by Owner

Running coding AI fully on your own hardware appeals for two reasons: privacy (your code never leaves your machine) and zero per-token cost. Cline supports this through Ollama-compatible endpoints. The setup is easy. The experience requires recalibrating expectations.

I’ve spent two months running Cline with various local models on a workstation with an RTX 4090 (24GB VRAM) and 64GB RAM. Here’s what works, what doesn’t, and what hardware you actually need.

The setup

Install Ollama (brew install ollama on Mac, the .deb on Linux). Pull a coding model:

ollama pull qwen2.5-coder:32b

In Cline, configure the API provider as OpenAI Compatible:

Base URL: http://localhost:11434/v1
API Key: ollama (any non-empty string)
Model ID: qwen2.5-coder:32b

That’s it for the connection. Cline immediately works against the local model.

Models I’ve tested

Model	Size	VRAM needed	Speed	Quality
Qwen 2.5 Coder 32B	19GB	24GB	8 tok/s	Best of the local set
Qwen 2.5 Coder 14B	9GB	12GB	24 tok/s	Decent for simpler tasks
Qwen 2.5 Coder 7B	4.5GB	8GB	60 tok/s	Acceptable for autocomplete-style
DeepSeek Coder V2 16B (lite)	9GB	12GB	22 tok/s	Strong on code, weak on instructions
CodeLlama 34B	19GB	24GB	6 tok/s	Older, mostly outclassed by Qwen
Llama 3.1 70B (4-bit quant)	38GB	48GB	2 tok/s	Strong but slow on consumer hardware

For Cline specifically, Qwen 2.5 Coder 32B is the best balance I’ve found on a single 24GB GPU. It’s not as good as Claude or GPT-4o, but it’s good enough for non-trivial work.

Where local models match cloud

Boilerplate generation. Writing CRUD endpoints, simple test cases, repetitive type definitions. Qwen 2.5 Coder produces this at quality comparable to Claude 3.5.

Code explanation. Asked to explain what a function does, the local models produce reasonable summaries. Not as polished as Claude, but accurate.

Single-file refactors. Renaming, extracting methods, converting between equivalent patterns. The models have enough capability to do this competently within a 2-4k token context.

Languages with strong open training data. Python, JavaScript, Go, Rust, Java. Models have seen plenty of these in pretraining; quality is reasonable.

Where local models fall short

Multi-file reasoning. Cline asks the model to plan a change across multiple files. Local models lose track of files outside the immediate one. Where Claude would say “this also affects services/billing.ts,” Qwen often misses the connection. Tasks that touch >3 files have a noticeable quality drop.

Tool use reliability. Cline tool-calling depends on the model emitting structured XML correctly every turn. Local models occasionally drift — emitting close-but-not-quite-correct tool calls that Cline rejects. The error rate per tool call is roughly 3-5x higher than Claude.

Context-window utilization. A 32k context window in Qwen feels effectively shorter than a 32k window in Claude. Long contexts degrade attention more sharply on the local models. Once you’re past 16k tokens of conversation, quality drops noticeably.

Niche languages. Elixir, Crystal, F#, Zig, anything with sparse training data. Local models guess; cloud models also guess but guess closer to right.

Following long .clinerules files. Local models pay attention to the first 10-15 rules and improvise on the rest. Claude follows much longer rule sets.

Hardware reality check

The “I’ll just run a local model” calculation on consumer hardware:

RTX 4090 (24GB). Qwen 2.5 Coder 32B fits in VRAM with room for context. About 8 tokens/second sustained. For a Cline session that produces 500 tokens of response per turn, that’s roughly a minute per turn. Workable but slow.

RTX 4080 (16GB) or 3090 Ti (24GB). 14B models work well; 32B is on the edge with quantization. Qwen 2.5 Coder 14B at ~24 tok/s is the realistic choice.

Mac M3 Max (40GB unified). Surprisingly competitive. Qwen 2.5 Coder 32B runs at about 14 tok/s — faster than the 4090 due to memory bandwidth advantages. Probably the best consumer setup for this.

Mac M2 Pro (16GB). 7B models are realistic; 14B is tight. Acceptable for autocomplete-style use; underwhelming for full Cline workflows.

Anything with under 12GB of VRAM/RAM. Stick with 7B models. The quality is enough for inline completions but not for Cline’s autonomous loop.

The cost picture

If your alternative is Claude 3.5 Sonnet via Cline, the average cost per hour of focused work is around $4-8 for me. At $20/day for 100 working days a year, that’s $2000.

A workstation that runs Qwen 2.5 Coder 32B comfortably is $2500-4000 amortized over 3 years (compute + electricity). For a single user, the breakeven against Claude is about 18-24 months at typical usage.

For privacy-mandatory contexts (regulated industries, sensitive code), the math doesn’t matter — you can’t use cloud anyway. For everyone else, the local-model setup is mostly a cost play, and the cost play is marginal once you account for the quality difference.

Where local + cloud hybrid makes sense

The pattern that’s been working for me:

Local model for routine tasks (formatting, simple refactors, autocomplete-style)
Cloud model for hard tasks (multi-file refactors, debugging, anything ambiguous)

Cline supports per-mode model selection in 3.4+. I use Qwen for Plan mode (cheap planning) and Claude for Act mode (reliable execution). This cuts the cloud bill substantially while keeping the quality on the edits where it matters.

Pure local is feasible for some workloads. Pure cloud is the path of least resistance. Hybrid is the practical middle that most people end up in if they care about either privacy or cost but not both at maximum.

What I’d skip

Don’t bother with smaller (sub-7B) coding models for Cline. The autonomous loop requires reliability that small models don’t have. They’ll work briefly and then get stuck in tool-call retry loops.

Don’t bother with general-purpose models (Llama 3.1 8B, Mistral 7B) over coding-specialized models of similar size. The coding-specialized variants are noticeably better at the specific task.

Don’t expect parity with Claude or GPT-4o. The gap is real. The local models are usable; they’re not equivalent.