ProductPublished 2026-01-29·7 min read

VisionFlux: a roadmap for local visual understanding

Our second product takes shape. What local vision-language models can do today — and where we are betting they go next.

Today's local VLMs are real

A year ago, "vision-language model on a phone" was a research thread. Today it is a product surface. Quantized 2–4B parameter VLMs (Qwen-VL, MiniCPM-V, Idefics, MLX-VLM ports of Llama-3.2-Vision) answer real questions about real images at interactive latency on flagship NPUs.

The capability frontier has crossed the consumer-product line: OCR with reasoning, document Q&A, scene description, screen understanding and visual grounding now all fit in a phone-sized memory budget when carefully quantized.

What VisionFlux will ship first

Document Q&A. Point at a page, ask anything. Layout-aware, table-aware, bilingual.
Translate the world. Menus, signs, packaging — overlay translation in place, on-device, zero round-trip to a server.
Accessibility narration. Describe a scene to a low-vision user with adjustable verbosity, controllable refresh rate, and a privacy guarantee no cloud-narration product can match.
Receipt and form ingestion. Photograph an expense, get structured fields, file it locally — no third-party SaaS.

Where we are going

Persistent, opt-in local memory of the things you have looked at, indexed by a small embedding model so you can ask "where did I see that thing?".
Scene mode — a continuously-running narration of a changing environment, throttled by motion and battery.
Multimodal notes — capture a thought as a photo + voice clip and let the model file it under the right project, with the right tags.
AR overlays when the platform allows — translation, accessibility, contextual help, all rendered locally.

What we are not building

We are not building a cloud-vision API competitor. The frontier of "what's the most you can extract from a single image with unlimited compute" is not where on-device wins. We win at "what's the most you can get from the camera in your pocket, right now, with no network."

That is a different product, and it is the one we want to ship.

Want updates like this in your inbox?

No newsletter platform. No tracking. We send a single email per launch.

VisionFlux: a roadmap for local visual understanding

Today's local VLMs are real

What VisionFlux will ship first

Where we are going

What we are not building

Continue reading

Why we bet the company on on-device AI

Designing Flux Engine: one runtime for every product

WhisperFlux preview: speech, speakers, summaries — all local