Why I am not running Claude Code with Ollama as the backend, yet

A few months ago I spent a long weekend trying to make Claude Code work with a locally hosted model on the Mac Mini, with Ollama serving the weights. The setup is well documented. The appeal is obvious. Privacy, no per token cost, no rate limits, no dependence on a single vendor. On paper, this is the dream configuration for anyone who treats coding assistants as part of their stable infrastructure rather than as a hosted experiment.

In practice, after living with it for a fortnight, I switched back to using the hosted model and have stayed there since. This is the writeup of why, and what would have to change for me to revisit the question. It is meant as a measured negative finding, not a complaint.

The setup people are excited about

The configuration is genuinely simple. Claude Code can be pointed at any compatible API endpoint, including a local Ollama instance, by setting a small number of environment variables. The largest Gemma 3 variant I can run on the Mini is the 12B, which sits comfortably in unified memory and gives me roughly 18 to 28 tokens per second on short prompts. I tried both that and a quantised 27B that just about fits if I close everything else, with the predictable trade off in throughput.

The mental model is clean. The CLI is identical. The interaction loop is the same. The only thing that changes is which model is doing the thinking. Anyone reading the documentation for the first time would conclude that this is a Sunday afternoon project, and they would be right.

The appeal once it is running is real. The first hour of using a coding assistant that does not send your code anywhere feels different in a way that is hard to put into words until you have done it. There is no mental tax of remembering not to paste in a particular file. There is no soft ceiling on how much of the codebase you let it see. There is no awkward silence when you are working offline.

That hour is the part of the experience that the marketing for any local first AI tool sells you, and it is genuine. The next forty hours are where the cracks start to show.

What actually happened

Three failure modes appeared, in order of severity.

The first was latency on real coding tasks. Asking a model to suggest a one liner is one thing. Asking it to read three files, propose a refactor that touches all of them, and write the diff is a different workload. Even on the Mac Mini at the 12B size, the round trip on a task of any size sits in the tens of seconds, and on the larger model it pushes into the minute mark before any output streams back. That is long enough that I lose flow state and switch to doing the change by hand.

This is partly a hardware story. The Mini is doing well to keep up at all. A workstation with a discrete GPU and dedicated VRAM would close most of the gap, at the cost of a different machine and a different power profile.

The second was the capability gap on multi file edits. The hosted models I am used to are, as of writing, comfortably ahead of any local model I can run on a single Mac. The gap is largest exactly where I most want help, on the kind of mid sized refactor that touches three or four files and needs to keep types, imports, and naming aligned across all of them. The local model would happily produce a diff that compiled, that did the right thing in the file it was looking at, and that subtly broke an interface in a sibling file because it had not held the wider picture in mind.

The diffs were never confidently wrong in a single place. They were quietly wrong in a way that needed careful reading to catch, which is the worst possible failure mode for a coding assistant. I spent more time auditing the output than I would have spent writing it from scratch, and after a week I noticed that I had stopped trusting it on anything that was not trivial.

The third was reasoning quality on tasks that needed actual thought, as opposed to mechanical translation. Things like “tell me what is suspicious about this file’s behaviour” or “why does this test fail in the way it does”. The local model produced answers that sounded like answers and that did not always survive contact with the actual code. The hosted models are still better at this kind of work in a way that matters.

The narrow set of cases where it does make sense

Local Claude Code is not useless. It has a real place. The honest version is that it shines exactly when the task is small and the privacy story is hard.

Quick local refactors on a single file fall in this category. Renaming a function and updating its callers within a single module. Generating a small SQL query from a description, where the schema fits in a single prompt. Writing a short utility function with a clear contract. On these tasks, the local model is fast enough, accurate enough, and the no data leaves the machine guarantee is genuinely valuable.

Anything that involves code that I am contractually required not to send to a third party also lives here. A handful of client codebases I work with come with confidentiality clauses that make a hosted assistant the wrong tool, full stop. For those projects, the local setup is the correct answer even if it is slower and weaker, because the alternative is no assistant at all.

The thing I do not have, and would like, is a clean configuration in Claude Code that lets me say “for this project, route to the local backend, for everything else, use the hosted one” without having to manage two separate environments. That is a tooling problem rather than a model problem, and the friction is the main reason my local setup has fallen out of daily use.

What would have to change

The thing that would tip the balance here is not better tooling. It is better hardware and better models, in roughly equal measure.

On the hardware side, the relevant change is more memory bandwidth at the price point of a sensible desktop machine. The Mac Studio class hardware does close the gap on the larger models. The pricing makes it a workstation purchase rather than a casual one, and it is still not as quick on a 27B class model as the hosted equivalents are at frontier sizes.

On the model side, the relevant change is open weights models that close the multi file reasoning gap. There has been steady progress here, and a year on from the last time I tried this the local options are noticeably better than they were. Another year on, the 12B class local models may well be where the 27B class is now, and the 27B class may be where today’s hosted mid tier sits. If that happens, the calculation flips.

The third leg, which is sometimes overlooked, is the surrounding tooling. Better caching of model weights, better warm start behaviour, better integration with editor diagnostics, better ways to give the model a project map without dumping the whole tree into the context window. These are all incremental improvements that compound.

I am genuinely optimistic about all three. I am also clear that the current state of the art on local hardware is not where I want it to be for daily driver use, and I would rather be honest about that than pretend otherwise to make the demo look good.

A non grumpy conclusion

Local AI tooling for coding is in roughly the same place that local LLMs for chat were eighteen months ago. Genuinely useful for a narrow set of tasks. A reasonable answer when privacy makes the hosted option impossible. Not yet a sensible default for general work, even for someone who, like me, is willing to pay a meaningful productivity tax in exchange for a better privacy story.

The wider lesson maps neatly to the kind of build versus buy decisions that keep coming up in conversations with small businesses. The temptation to build everything in house, on infrastructure you fully control, is strong. It is sometimes the right answer. It is more often partially the right answer, where some workloads belong locally and others belong in a hosted service, and the engineering work consists of putting the line in the right place rather than choosing a side. That is the kind of question I help with through efficiency consulting, and the answers tend to be more nuanced than the loud version of either side suggests.

For now, my own configuration is a hosted model for daily driver coding, a local Ollama setup for the small and the sensitive, and a quarterly review of the gap between them. Each review I expect to find that the gap has narrowed by a notch. At some point the daily driver will flip back to local. Today is not that day.