A few months ago I wanted to build a small vision app for myself. Nothing ambitious. A browser page that could see what the laptop camera was pointing at, describe it when asked, and accept a voice instruction without any of that material leaving the machine on my desk. The exercise turned into a working prototype that I have used quite a lot since, and that has clarified my thinking about where local first vision actually fits.
The whole stack runs on the Mac Mini M4 I described in an earlier post. The browser is just a thin client. Nothing in the loop touches a third party API.
Why it matters where the camera feed goes
Cloud vision APIs are often the right answer. They are accurate, fast, well documented, and the per call price is usually low enough that you do not think about it. The tradeoff is that your camera feed is then someone else’s input. For a lot of use cases that is fine. Holiday photos, hobby projects, anything where you would happily post the picture in a group chat.
For a number of other use cases it is less fine. A small clinic that wants to use a phone camera to triage skin lesions has a real problem with the default cloud answer. So does a legal team scanning physical documents on a copy stand. So does anyone whose camera feed regularly captures other people who have not consented to having their faces sent to a US data centre.
The interesting design space sits between those two extremes. Most pieces of office equipment do not need to be locally hosted for privacy reasons. Some clearly do. The space in the middle, where you would prefer local but you are not strictly required to do it that way, is where the architecture I am about to describe earns its keep.
Architecture
The shape is simple.
[Browser] <----- HTTP / WebSocket -----> [Mac Mini service]
| |
|- camera capture |- RF-DETR (object detection)
|- Whisper.js (voice in) |- Gemma 3 vision (description)
|- canvas overlay (boxes) |
|- chat UI |
Three model components, doing genuinely different jobs.
RF-DETR runs on the Mac for object detection. It receives a frame, returns bounding boxes with class labels and confidence scores. Latency is low enough that you can run it at a few frames per second over a websocket and the overlay feels responsive.
Gemma 3 runs on the Mac for vision language reasoning. It receives a frame plus a question and returns a textual answer. Latency is significantly higher than detection, in the order of one to three seconds for a single frame and a short prompt, so it is a tool you reach for deliberately rather than continuously.
Whisper.js runs in the browser, in WebAssembly, transcribing voice input on the device. Audio never leaves the laptop.
The browser holds it all together. You point the camera, speak a question, the question is transcribed locally, sent to the Mac with the current frame, and the Mac decides whether to answer from the bounding boxes alone or to call into the VLM.
Why RF-DETR rather than YOLO
I started with YOLO because that is what most people start with. It is fast, accurate, and the surrounding tooling is mature. Two things eventually pushed me to RF-DETR.
The first is the absence of non maximum suppression in the inference path. RF-DETR is a transformer based detector that emits a fixed set of object queries directly. There is no NMS step, which removes a class of awkward edge cases around overlapping boxes for the same object, and which simplifies the postprocessing code on the browser side considerably.
The second is latency consistency. YOLO is faster on the average frame, often noticeably so. RF-DETR is more consistent across frames, with fewer of the latency spikes you sometimes see when a YOLO model is asked to draw a lot of boxes. For a real time overlay, consistency matters more than peak speed. A tracker that hitches every few seconds is more annoying than a slightly slower one that never hitches.
The model size is reasonable. The variant I am running fits comfortably alongside a small VLM in unified memory and leaves enough headroom for the browser, the websocket server, and the rest of the stack.
Whisper.js in the browser
The browser side speech transcription is the part of this that surprised me most. Modern Whisper builds compiled to WebAssembly are genuinely good, including on a laptop without dedicated AI hardware, provided you stick to the smaller models.
The base model is the right choice for a single user push to talk interface. Cold start is in the order of one to two seconds the first time, while the WASM module and weights load. After that, transcription of a four to five second utterance completes in roughly the time it takes to lift your finger off the spacebar.
Where it gets less convincing is on continuous streaming transcription with poor microphones, in noisy rooms, or with strong accents that the model has not seen much during training. Push to talk with a decent headset is the path of least pain. Anything more ambitious starts to want a server side model and a more carefully tuned audio pipeline.
I had to make peace with two specific quirks. Cold cache is genuinely cold. If a user lands on the page and presses talk in the first two seconds, the first transcription will be late. A small priming step that loads the WASM module on page load fixes most of that. The second is that the models occasionally hallucinate filler words at the start of an utterance, particularly the word “you” for short phrases. A simple regex that strips a leading “you ” before passing the text to the next stage was the cheapest fix.
Calling Gemma only when it adds something
The VLM is the slowest piece of the chain by an order of magnitude. The architectural choice that makes the app feel responsive is treating it as a tool to be invoked, rather than a step that runs every frame.
Most camera questions can be answered from the bounding boxes. “How many people are in the room” is a counting query against the detector output, no VLM needed. “Is the laptop on the desk” is a presence query. “What is the dominant colour of the chair” is a colour query that I can answer from the average pixel inside the relevant bounding box.
For anything that requires actual scene description, “what is the person in the kitchen doing” or “describe the layout of this room”, you need the VLM. The orchestration code on the Mac decides which path to take based on a small classifier over the parsed question. When the answer is in the boxes, the response is back in a couple of hundred milliseconds. When the VLM is genuinely needed, the user sees a typing indicator for a couple of seconds while Gemma works.
This split is the part of the design I am most pleased with. It mirrors the way human attention works, where you only invoke the slow careful look when the fast glance is not enough.
Limitations
Several pieces of the stack are imperfect in ways I have not solved.
Model swap times on the Mac are noticeable. If I have not used the VLM for a while and Ollama has unloaded the weights, the first call takes a few extra seconds while it warms up. A keepalive ping every few minutes mostly hides this. Mostly.
The browser side WebRTC plumbing is fragile. Camera permissions, microphone permissions, and WebSocket reconnections all need careful handling, and the failure modes are different across browsers. The app works reliably in Safari and Chrome on macOS. The Firefox story is messier, mostly because the WASM Whisper performance is more variable.
The detector’s class vocabulary is the COCO classes plus a small custom finetune I bolted on for a few items I care about. It is not a general purpose semantic model. For genuinely open vocabulary detection you want a different architecture, and that opens a different set of latency tradeoffs.
The thing I would most like to add is a small persistent memory layer, so that the assistant remembers what it has seen during a session. That is mostly a matter of writing a sensible state store on the Mac, but I have not had the appetite for the prompt engineering work that comes with it.
Where this fits commercially
I built this for fun, and because I wanted to understand the architecture in my fingers rather than only on paper. Sitting back and looking at it now, the commercial relevance is clear enough.
In the UK, regulated sectors like healthcare, legal services, and parts of financial services all have genuine reasons to keep camera and document data on premises. The current default of sending everything to a hosted model has been chosen by inertia, not by analysis. A local first stack like the one above is well within reach for any organisation with a competent technical lead and a small budget for hardware. The work involved is mostly engineering rather than research, and the results are predictable rather than experimental.
If you run a UK SME and you are wondering whether this kind of architecture is worth a closer look, I do fractional business analysis through the consultancy, focused exactly on this question of where local first AI earns its keep and where the hosted option is still the right answer. The honest version of the answer depends on what your data is, who has agreed to what about it, and how much pain a vendor outage would cause you. None of which is the kind of thing a generic blog post can settle.
What I would build next
The version of this prototype that I would actually deploy somewhere would have three changes.
A small fleet management layer for the Mini, so that updates to the detector or the VLM weights can be rolled out cleanly. An auth layer in front of the WebSocket, so that the browser is not the only thing protecting access. And a serious logging layer that records, locally, what the system was asked and what it answered, so that an operator can audit what has happened in a given week.
None of those are hard problems. None of them are interesting problems either. They are the kind of work that gets a prototype across the line into something that an organisation with a compliance team can actually live with, and they are the work that local first AI projects most often skip on the way to the demo.