Speaker diarization on a phone: a deep dive
How we run end-to-end speaker diarization in real time on a 4 W power budget — without uploading any audio.
The problem
Diarization answers "who spoke when?" It is the difference between a transcript that is a wall of text and a transcript that is a conversation. Without diarization, "let's ship it" is attributed to nobody; with diarization, it lands on a name and turns into an action item.
The pipeline
- VAD (voice activity detection) — gate everything else. We only spend cycles on actual speech. Our VAD is a 200 KB CNN on log-mel features running every 10 ms.
- Speaker embedding — extract a fixed-length 192-dim vector per voiced segment. We use an ECAPA-TDNN-style model distilled to 12 MB int8.
- Online clustering — agglomerative clustering with refinement. New evidence updates past labels.
- Smoothing — short alternations under 400 ms are merged to prevent UI flicker.
What is hard on mobile
- The embedding model has to be small enough to run alongside the ASR model and an LLM summarizer in the same memory envelope.
- Clustering has to be incremental — we cannot re-cluster the entire meeting at every step. We use a constant-memory online agglomerative clusterer with cosine distance.
- The UI cannot flicker as labels are refined. We animate label changes with a 200 ms crossfade rather than swapping the text.
- Overlapping speech (two people talking at once) breaks single-label assumptions. We label such regions with the dominant speaker and mark the segment as "overlap" so downstream summarization can lower its weight.
What we ended up with
A 12 MB embedding model, a streaming clustering algorithm with a 6-second update window, and a transcript view that animates label changes instead of jumping. Median end-to-end first-label latency is 950 ms on iPhone 15 Pro, 1.6 s on Pixel 8a.
What still hurts
- Soundalikes — the same vocal range, same accent, same gender — confuse the embedding for the first ~30 seconds. We surface this as a "still learning voices" hint in the UI.
- Crosstalk in noisy rooms degrades sharply. We are exploring a tiny on-device source-separation pre-stage.
Want updates like this in your inbox?
No newsletter platform. No tracking. We send a single email per launch.
Subscribe