EngineeringPublished 2026-02-12·11 min read

Speaker diarization on a phone: a deep dive

How we run end-to-end speaker diarization in real time on a 4 W power budget — without uploading any audio.

The problem

Diarization answers "who spoke when?" It is the difference between a transcript that is a wall of text and a transcript that is a conversation. Without diarization, "let's ship it" is attributed to nobody; with diarization, it lands on a name and turns into an action item.

The pipeline

VAD (voice activity detection) — gate everything else. We only spend cycles on actual speech. Our VAD is a 200 KB CNN on log-mel features running every 10 ms.
Speaker embedding — extract a fixed-length 192-dim vector per voiced segment. We use an ECAPA-TDNN-style model distilled to 12 MB int8.
Online clustering — agglomerative clustering with refinement. New evidence updates past labels.
Smoothing — short alternations under 400 ms are merged to prevent UI flicker.

What is hard on mobile

The embedding model has to be small enough to run alongside the ASR model and an LLM summarizer in the same memory envelope.
Clustering has to be incremental — we cannot re-cluster the entire meeting at every step. We use a constant-memory online agglomerative clusterer with cosine distance.
The UI cannot flicker as labels are refined. We animate label changes with a 200 ms crossfade rather than swapping the text.
Overlapping speech (two people talking at once) breaks single-label assumptions. We label such regions with the dominant speaker and mark the segment as "overlap" so downstream summarization can lower its weight.

What we ended up with

A 12 MB embedding model, a streaming clustering algorithm with a 6-second update window, and a transcript view that animates label changes instead of jumping. Median end-to-end first-label latency is 950 ms on iPhone 15 Pro, 1.6 s on Pixel 8a.

What still hurts

Soundalikes — the same vocal range, same accent, same gender — confuse the embedding for the first ~30 seconds. We surface this as a "still learning voices" hint in the UI.
Crosstalk in noisy rooms degrades sharply. We are exploring a tiny on-device source-separation pre-stage.

Want updates like this in your inbox?

No newsletter platform. No tracking. We send a single email per launch.

Speaker diarization on a phone: a deep dive

The problem

The pipeline

What is hard on mobile

What we ended up with

What still hurts

Continue reading

Why we bet the company on on-device AI

Designing Flux Engine: one runtime for every product

WhisperFlux preview: speech, speakers, summaries — all local