Journal
EngineeringPublished 2026-02-12·11 min read

Speaker diarization on a phone: a deep dive

How we run end-to-end speaker diarization in real time on a 4 W power budget — without uploading any audio.

The problem

Diarization answers "who spoke when?" It is the difference between a transcript that is a wall of text and a transcript that is a conversation. Without diarization, "let's ship it" is attributed to nobody; with diarization, it lands on a name and turns into an action item.

The pipeline

  1. VAD (voice activity detection) — gate everything else. We only spend cycles on actual speech. Our VAD is a 200 KB CNN on log-mel features running every 10 ms.
  2. Speaker embedding — extract a fixed-length 192-dim vector per voiced segment. We use an ECAPA-TDNN-style model distilled to 12 MB int8.
  3. Online clustering — agglomerative clustering with refinement. New evidence updates past labels.
  4. Smoothing — short alternations under 400 ms are merged to prevent UI flicker.

What is hard on mobile

  • The embedding model has to be small enough to run alongside the ASR model and an LLM summarizer in the same memory envelope.
  • Clustering has to be incremental — we cannot re-cluster the entire meeting at every step. We use a constant-memory online agglomerative clusterer with cosine distance.
  • The UI cannot flicker as labels are refined. We animate label changes with a 200 ms crossfade rather than swapping the text.
  • Overlapping speech (two people talking at once) breaks single-label assumptions. We label such regions with the dominant speaker and mark the segment as "overlap" so downstream summarization can lower its weight.

What we ended up with

A 12 MB embedding model, a streaming clustering algorithm with a 6-second update window, and a transcript view that animates label changes instead of jumping. Median end-to-end first-label latency is 950 ms on iPhone 15 Pro, 1.6 s on Pixel 8a.

What still hurts

  • Soundalikes — the same vocal range, same accent, same gender — confuse the embedding for the first ~30 seconds. We surface this as a "still learning voices" hint in the UI.
  • Crosstalk in noisy rooms degrades sharply. We are exploring a tiny on-device source-separation pre-stage.

Want updates like this in your inbox?

No newsletter platform. No tracking. We send a single email per launch.

Subscribe