Journal
ProductPublished 2026-01-29·7 min read

VisionFlux: a roadmap for local visual understanding

Our second product takes shape. What local vision-language models can do today — and where we are betting they go next.

Today's local VLMs are real

A year ago, "vision-language model on a phone" was a research thread. Today it is a product surface. Quantized 2–4B parameter VLMs (Qwen-VL, MiniCPM-V, Idefics, MLX-VLM ports of Llama-3.2-Vision) answer real questions about real images at interactive latency on flagship NPUs.

The capability frontier has crossed the consumer-product line: OCR with reasoning, document Q&A, scene description, screen understanding and visual grounding now all fit in a phone-sized memory budget when carefully quantized.

What VisionFlux will ship first

  • Document Q&A. Point at a page, ask anything. Layout-aware, table-aware, bilingual.
  • Translate the world. Menus, signs, packaging — overlay translation in place, on-device, zero round-trip to a server.
  • Accessibility narration. Describe a scene to a low-vision user with adjustable verbosity, controllable refresh rate, and a privacy guarantee no cloud-narration product can match.
  • Receipt and form ingestion. Photograph an expense, get structured fields, file it locally — no third-party SaaS.

Where we are going

  • Persistent, opt-in local memory of the things you have looked at, indexed by a small embedding model so you can ask "where did I see that thing?".
  • Scene mode — a continuously-running narration of a changing environment, throttled by motion and battery.
  • Multimodal notes — capture a thought as a photo + voice clip and let the model file it under the right project, with the right tags.
  • AR overlays when the platform allows — translation, accessibility, contextual help, all rendered locally.

What we are not building

We are not building a cloud-vision API competitor. The frontier of "what's the most you can extract from a single image with unlimited compute" is not where on-device wins. We win at "what's the most you can get from the camera in your pocket, right now, with no network."

That is a different product, and it is the one we want to ship.

Want updates like this in your inbox?

No newsletter platform. No tracking. We send a single email per launch.

Subscribe