ProductPublished 2026-03-25·7 min read

WhisperFlux preview: speech, speakers, summaries — all local

Our flagship enters internal beta. A walkthrough of the streaming pipeline and what we learned shipping ASR on real phones.

Two months in the field

We have been quietly testing WhisperFlux internally for two months. This post is the first public look at how it works — and why it works the way it does.

The hard part is not transcription

The core challenge of on-device transcription is not transcription itself. Whisper-class models trained on multilingual corpora deliver word-error rates that are competitive with cloud services for most consumer languages. The hard part is diarization: figuring out who said what, in real time, with no cloud help and a 4-watt power envelope.

We landed on a streaming clustering approach with three properties:

End-to-end latency under 1.2 s on flagship NPUs (under 2 s on midrange)
Robust to overlap when multiple speakers talk at once
Refines past labels as new audio arrives, without flickering the UI

A 12 MB embedding model produces a 192-dimensional speaker vector every voiced segment; an online agglomerative clusterer with a six-second update window assigns labels. Past labels can be merged or split as evidence accrues, and the transcript view animates the change rather than popping it into place.

Summaries that survive context length

Long meetings overflow the model's context. WhisperFlux summarizes in chunks of ~3 minutes with a rolling memory: each chunk produces a structured "what happened" record (decisions, action items, open questions) that the next chunk reads. Final summaries are assembled from these records, never by stuffing the full transcript into the prompt. This is the difference between a coherent end-of-meeting briefing and a hallucinated paragraph.

A calm UI for a paranoid moment

A privacy-preserving recorder should feel calm, not clinical. We obsessed over the recording state — the dot pulse rate, the wave envelope, the timer typography — until the app physically reassures the user that it is working without showing off. The single bright affordance is the airplane icon at the top of the screen, indicating that nothing is being uploaded.

What is next

Speaker labels you can rename. Per-speaker action item assignment. Export to your existing notes app of choice — by file, never by API.

Want updates like this in your inbox?

No newsletter platform. No tracking. We send a single email per launch.

WhisperFlux preview: speech, speakers, summaries — all local

Two months in the field

The hard part is not transcription

Summaries that survive context length

A calm UI for a paranoid moment

What is next

Continue reading

Why we bet the company on on-device AI

Designing Flux Engine: one runtime for every product

Squeezing a 7B model onto your phone: a quantization field guide