Designing Flux Engine: one runtime for every product
A peek at the architecture choices behind our shared on-device inference runtime — and the constraints that shaped them.
A runtime, not a wrapper
Building one app on-device is hard. Building five is harder. Building five that share a coherent technical foundation, ship updates in lockstep and never regress on battery life — that is a runtime problem.
Flux Engine is our answer. It is a thin, opinionated layer over the best open-source runtimes in their respective lanes — speech (whisper.cpp / faster-whisper), vision (llama.cpp / mlx-vlm), language (llama.cpp / MLC-LLM, executorch on Android) — coordinated by a single scheduler that understands what is currently in memory, what the user is doing and what the device can sustain.
The three constraints
1. Cold-start latency
Models live on disk most of the time. The user does not. Flux Engine prewarms the most likely next model based on app-usage patterns and a tiny on-device classifier. The classifier itself is a 50 KB MLP — small enough to run on every wake.
2. Sustained throughput on small NPUs
We never spawn parallel workloads that compete for the same accelerator. The scheduler is queue-based and priority-aware, with a hard yield to the foreground app. ASR streams take precedence over background summarization; tap-to-translate takes precedence over a background indexing pass.
3. Graceful degradation when memory is scarce
If the user is editing a 4K video, we will not OOM their phone to run a summary. The engine reports a soft "capacity remaining" signal that products honor — features may downshift to a smaller model, defer to background, or surface a "ready when your phone cools off" affordance.
What Flux Engine actually exposes
A handful of C++ abstractions and a thin Swift / Kotlin / Rust binding:
FluxSession— a long-lived inference session bound to a model + device.FluxStream— incremental output, cancellable mid-token.FluxBudget— the device-aware capacity report.FluxRouter— picks the right model for a request based on device tier, battery, thermal state and product preference.
That is it. Five primitives. Every Flux* product builds on top of these.
What this gets us
One bug fix, one perf win, one quantization improvement — five products get better at once. One regression fix in the scheduler — five products stop spiking battery at once. The economic case for a shared runtime, on a team our size, is overwhelming.
What we got wrong (so far)
- We over-invested in a graph compiler in the first six months. Hand-tuned model loaders won.
- We under-invested in our profiling story. We rebuilt it from scratch in month nine.
- We assumed a single quantization scheme across products. We now ship per-product mixes.
Want updates like this in your inbox?
No newsletter platform. No tracking. We send a single email per launch.
Subscribe