Journal
EngineeringPublished 2026-03-11·8 min read

Squeezing a 7B model onto your phone: a quantization field guide

Q4_K_M, AWQ, GPTQ, SmoothQuant — what actually matters when you only have 4 GB of RAM and a 4 W power budget.

Quantization is the whole game on mobile

The bottleneck on mobile is not FLOPs — it is bytes. A 7B-parameter model in fp16 is 14 GB. Your phone has 4–8 GB total, of which you can realistically borrow 1.5–3 GB for an AI workload before the OS starts paging out the foreground app. The arithmetic does not work without aggressive quantization.

What we actually ship

  • Q4_K_M (k-quants, mixed groupwise) for general LLM workloads — the best quality-per-byte we have measured under 4 bits per weight, with surprisingly graceful failure modes on long-tail tokens.
  • AWQ-int4 (activation-aware weight quantization) for code and mathematical reasoning, where activation outliers matter and naive RTN quantization shows visible degradation.
  • fp16 hot weights for the embedding table and the final lm_head projection — these two layers account for a small fraction of total weights but a disproportionate share of perceived quality.
  • int8 K/V cache with a per-token scale, keeping the cache budget tractable without the catastrophic loss seen at int4 KV.
  • GGUF as the on-disk container so we can swap quantization schemes without re-engineering the loader.

Perplexity is not the right metric

Users do not care about next-token cross-entropy. They care about whether the summary is correct, whether the translation reads naturally, whether the diff message describes the change. We built a small in-house eval harness that scores end-task quality on real meeting transcripts, real menu translations and real commit diffs. It disagrees with perplexity about 18% of the time. Sometimes a quantization scheme that loses 0.3 bits of perplexity wins 4 points of summarization quality, because the failure mode shifts from "subtly worse next token" to "fewer wild excursions."

What surprised us

  • Group size matters more than bit width for some layers. Going from 128 to 64 group size on the FFN saved more quality than dropping from 4 to 5 bits on the attention.
  • Outliers cluster. A handful of channels in every transformer block carry disproportionate activation magnitude. Treating them with a higher precision branch (à la SmoothQuant or per-channel scales) is more efficient than uniformly raising the bit width.
  • Calibration data is leverage. A 256-sample calibration set drawn from the actual product workload beats a 8K-sample set drawn from C4 every time.

The takeaway

Do not pick a quantization scheme by paper. Pick it by your eval — and your eval should look exactly like your product.

Want updates like this in your inbox?

No newsletter platform. No tracking. We send a single email per launch.

Subscribe