EngineeringPublished 2026-01-22·7 min read

TranslateFlux: building a private, offline universal translator

Notes on translation latency, quality, and the engineering tricks that let a small model feel competitive with a much larger cloud one.

The problem

Cloud translation is excellent, fast and free of charge — until you are on a plane, in a foreign country with no roaming, in a hospital where bringing audio off-site is a violation, or simply someone who would rather not stream conversations to a third party.

TranslateFlux is our answer: a fully on-device translator for text, voice and images, in the languages people actually need.

Why a small model can be enough

A 1–3B parameter MoE-style or distilled translation model, run in 4-bit quantization, hits a quality ceiling that is shockingly close to the cloud incumbents for the top 50 language pairs. The gap that remains is in long-form, high-context translation — the kind that takes 2-3 paragraphs of preceding context — and we can close most of it with retrieval over your own previous translations.

Two engineering tricks make a small model feel large:

Glossary injection. Per-conversation, the model is given a small dictionary of terms you have used before — names, acronyms, technical vocabulary — and uses them consistently. Cloud APIs do not have this signal.
Style steering. A short style preamble ("formal Japanese, business email") nudges output far more cheaply than an extra billion parameters would.

Latency budget

For voice translation, our budget is 600 ms from end-of-utterance to start-of-response. We split it as:

ASR finalization: 100–200 ms (using WhisperFlux's streaming endpoint with confidence-triggered finalization).
Translation prefill + first token: 200–300 ms (the small model has small KV cache, and we keep prompts tight).
TTS first audio: 100–200 ms (Kokoro / Piper / on-device platform TTS).

Total: tight, but interactive. Subjectively, a 600 ms gap feels like a slightly polite person — the gap a cloud translator gets too is rarely below 350 ms after network.

What ships first

50 languages, 50×50 pairs.
Text, voice (push-to-talk and continuous), and image (translate-the-photo).
A "live conversation" mode with two parallel decoding contexts, one per direction.
Glossary, style and tone presets per contact / per use case.

What we are honest about

For poetic, literary, or extremely culture-bound translation, a frontier cloud model is still better. We will recommend you use one (your choice of provider) and we will route nothing to it from us.
For very low-resource pairs, quality drops. We will be explicit about which pairs we have evaluated and at what BLEU/COMET range.

That honesty is part of what we want TranslateFlux to feel like.

Want updates like this in your inbox?

No newsletter platform. No tracking. We send a single email per launch.

TranslateFlux: building a private, offline universal translator

The problem

Why a small model can be enough

Latency budget

What ships first

What we are honest about

Continue reading

Why we bet the company on on-device AI

Designing Flux Engine: one runtime for every product

WhisperFlux preview: speech, speakers, summaries — all local