Journal
EngineeringPublished 2026-04-08·9 min read

Designing Flux Engine: one runtime for every product

A peek at the architecture choices behind our shared on-device inference runtime — and the constraints that shaped them.

A runtime, not a wrapper

Building one app on-device is hard. Building five is harder. Building five that share a coherent technical foundation, ship updates in lockstep and never regress on battery life — that is a runtime problem.

Flux Engine is our answer. It is a thin, opinionated layer over the best open-source runtimes in their respective lanes — speech (whisper.cpp / faster-whisper), vision (llama.cpp / mlx-vlm), language (llama.cpp / MLC-LLM, executorch on Android) — coordinated by a single scheduler that understands what is currently in memory, what the user is doing and what the device can sustain.

The three constraints

1. Cold-start latency

Models live on disk most of the time. The user does not. Flux Engine prewarms the most likely next model based on app-usage patterns and a tiny on-device classifier. The classifier itself is a 50 KB MLP — small enough to run on every wake.

2. Sustained throughput on small NPUs

We never spawn parallel workloads that compete for the same accelerator. The scheduler is queue-based and priority-aware, with a hard yield to the foreground app. ASR streams take precedence over background summarization; tap-to-translate takes precedence over a background indexing pass.

3. Graceful degradation when memory is scarce

If the user is editing a 4K video, we will not OOM their phone to run a summary. The engine reports a soft "capacity remaining" signal that products honor — features may downshift to a smaller model, defer to background, or surface a "ready when your phone cools off" affordance.

What Flux Engine actually exposes

A handful of C++ abstractions and a thin Swift / Kotlin / Rust binding:

  • FluxSession — a long-lived inference session bound to a model + device.
  • FluxStream — incremental output, cancellable mid-token.
  • FluxBudget — the device-aware capacity report.
  • FluxRouter — picks the right model for a request based on device tier, battery, thermal state and product preference.

That is it. Five primitives. Every Flux* product builds on top of these.

What this gets us

One bug fix, one perf win, one quantization improvement — five products get better at once. One regression fix in the scheduler — five products stop spiking battery at once. The economic case for a shared runtime, on a team our size, is overwhelming.

What we got wrong (so far)

  • We over-invested in a graph compiler in the first six months. Hand-tuned model loaders won.
  • We under-invested in our profiling story. We rebuilt it from scratch in month nine.
  • We assumed a single quantization scheme across products. We now ship per-product mixes.

Want updates like this in your inbox?

No newsletter platform. No tracking. We send a single email per launch.

Subscribe