EngineeringPublished 2026-04-08·9 min read

Designing Flux Engine: one runtime for every product

A peek at the architecture choices behind our shared on-device inference runtime — and the constraints that shaped them.

A runtime, not a wrapper

Building one app on-device is hard. Building five is harder. Building five that share a coherent technical foundation, ship updates in lockstep and never regress on battery life — that is a runtime problem.

Flux Engine is our answer. It is a thin, opinionated layer over the best open-source runtimes in their respective lanes — speech (whisper.cpp / faster-whisper), vision (llama.cpp / mlx-vlm), language (llama.cpp / MLC-LLM, executorch on Android) — coordinated by a single scheduler that understands what is currently in memory, what the user is doing and what the device can sustain.

The three constraints

1. Cold-start latency

Models live on disk most of the time. The user does not. Flux Engine prewarms the most likely next model based on app-usage patterns and a tiny on-device classifier. The classifier itself is a 50 KB MLP — small enough to run on every wake.

2. Sustained throughput on small NPUs

We never spawn parallel workloads that compete for the same accelerator. The scheduler is queue-based and priority-aware, with a hard yield to the foreground app. ASR streams take precedence over background summarization; tap-to-translate takes precedence over a background indexing pass.

3. Graceful degradation when memory is scarce

If the user is editing a 4K video, we will not OOM their phone to run a summary. The engine reports a soft "capacity remaining" signal that products honor — features may downshift to a smaller model, defer to background, or surface a "ready when your phone cools off" affordance.

What Flux Engine actually exposes

A handful of C++ abstractions and a thin Swift / Kotlin / Rust binding:

FluxSession — a long-lived inference session bound to a model + device.
FluxStream — incremental output, cancellable mid-token.
FluxBudget — the device-aware capacity report.
FluxRouter — picks the right model for a request based on device tier, battery, thermal state and product preference.

That is it. Five primitives. Every Flux* product builds on top of these.

What this gets us

One bug fix, one perf win, one quantization improvement — five products get better at once. One regression fix in the scheduler — five products stop spiking battery at once. The economic case for a shared runtime, on a team our size, is overwhelming.

What we got wrong (so far)

We over-invested in a graph compiler in the first six months. Hand-tuned model loaders won.
We under-invested in our profiling story. We rebuilt it from scratch in month nine.
We assumed a single quantization scheme across products. We now ship per-product mixes.

Want updates like this in your inbox?

No newsletter platform. No tracking. We send a single email per launch.

Designing Flux Engine: one runtime for every product

A runtime, not a wrapper

The three constraints

1. Cold-start latency

2. Sustained throughput on small NPUs

3. Graceful degradation when memory is scarce

What Flux Engine actually exposes

What this gets us

What we got wrong (so far)

Continue reading

Why we bet the company on on-device AI

WhisperFlux preview: speech, speakers, summaries — all local

Squeezing a 7B model onto your phone: a quantization field guide