Why Edge ML Inference is the Next Frontier

March 20, 2025

Hot take: if your ML model needs a round trip to the cloud to make a prediction, you're building yesterday's product.

I'm Usamah Zaheer, and I work on ML inference optimisation at Arm. Before that, I deployed perception models on actual robots at Dyson. The gap between what works in a Jupyter notebook and what runs on a 4-watt chip is where the real engineering lives — and it's where I've spent the last several years of my career.

This post is the most comprehensive thing I've written on the subject. I'm going to walk through why edge inference matters, the gnarly engineering problems underneath it, and where I think this is all heading.

The latency argument is just the beginning

Yeah, edge inference is faster. You cut out the network hop, you get sub-millisecond predictions. But that's the obvious part. The real reasons edge ML matters:

Privacy by architecture. When your model runs on-device, user data never leaves the hardware. You don't need to write a privacy policy for data you never collect. That's not a feature — that's a fundamentally different trust model. In a world where GDPR fines are measured in billions and users are increasingly privacy-conscious, on-device inference isn't just nice to have — it's a competitive moat.

Reliability. Your cloud-dependent model is one DNS outage away from being a very expensive paperweight. Edge models work in airplane mode, in a factory basement, in the middle of the ocean. I've seen production systems go down because someone's WiFi was flaky. Edge doesn't care. When I was at Dyson, the robots couldn't pause and wait for a server response while navigating a room — they needed perception that worked regardless of connectivity.

Cost at scale. Run inference for a million users in the cloud and your CFO will have questions. Run it on-device and your marginal cost per user approaches zero. The math gets really compelling really fast. I've seen teams spend more on their inference API bills than on their entire engineering headcount. That's not sustainable, and it's not necessary.

Sovereignty and compliance. For industries like healthcare, defence, and automotive, data often can't leave the device or the country. Edge inference solves the compliance problem at the architecture level rather than the policy level.

The quantization deep dive

Going from FP32 to INT8 without destroying your model's accuracy is the bread and butter of edge ML engineering. I've spent weeks — sometimes months — tuning quantization parameters for a single model. Here's what I've learned.

Post-Training Quantization (PTQ) vs. Quantization-Aware Training (QAT). PTQ is the fast path: you take a trained model, calibrate it with a representative dataset, and convert the weights and activations to lower precision. It works surprisingly well for CNNs and many transformer architectures. But when PTQ drops accuracy below your threshold, QAT is the answer — you simulate quantization during training so the model learns to be robust to reduced precision. QAT typically recovers 1-3% of the accuracy lost by PTQ, which can be the difference between shipping and not shipping.

Per-channel vs. per-tensor quantization. Per-tensor quantization uses a single scale factor for an entire weight tensor. Simple, fast, but lossy. Per-channel quantization assigns a scale factor to each output channel of a convolution or each row of a linear layer. The overhead is minimal, but the accuracy improvement is significant — especially for depthwise separable convolutions, which are the backbone of most efficient architectures like MobileNet and EfficientNet.

Mixed-precision quantization. Not all layers are equally sensitive to quantization. The first and last layers of a network, attention layers in transformers, and layers with small weight ranges tend to need higher precision. Mixed-precision approaches keep sensitive layers in FP16 while quantizing everything else to INT8 or even INT4. The trick is identifying which layers are sensitive — you can use sensitivity analysis (quantize one layer at a time and measure accuracy impact), Hessian-based methods, or learned approaches like HAQ.

The tooling landscape. TensorRT handles quantization well for NVIDIA hardware. ONNX Runtime's quantization tools are hardware-agnostic and improving rapidly. For Arm hardware, ArmNN and the Arm Compute Library (ACL) provide optimised INT8 and FP16 kernels. PyTorch's native quantization API has matured significantly, and tools like AIMET from Qualcomm offer advanced techniques like AdaRound and cross-layer equalization. Each tool has its strengths, and in practice, you often use multiple tools in a single deployment pipeline.

The key insight is that quantization isn't a one-shot process. It's an iterative loop: quantize, measure, profile, adjust. The engineers who treat it as a checkbox ("we quantized the model, ship it") are the ones who end up with models that are fast but wrong.

Memory is the real bottleneck

Everyone talks about compute — TOPS, FLOPS, operations per second. But on edge devices, memory bandwidth is usually what kills you first. A model might need 2 billion multiply-accumulate operations per inference, but if those operations are bottlenecked by how fast you can feed data to the compute units, all those TOPS are wasted.

Cache hierarchies matter. Modern Arm processors have multi-level cache hierarchies: L1 (fast, small, ~64KB), L2 (slower, larger, ~256KB-1MB), and sometimes L3. If your working set fits in L1, you're golden. If it spills to L2 or main memory, you can see 10-100x latency increases for memory accesses. Tiling your computations — breaking large matrix multiplications into cache-friendly chunks — is essential. This is something ML compilers handle, but understanding why it matters makes you a better engineer.

Bandwidth calculations. Here's a back-of-envelope calculation that I do constantly: A typical edge SoC might have 8-16 GB/s of memory bandwidth. A ResNet-50 in FP32 has ~100MB of weights. At 30fps, you need to read those weights 30 times per second — that's 3 GB/s just for weight reads, before accounting for activations, input data, or anything else. Quantize to INT8 and you cut that to 750 MB/s. That's the difference between "works" and "doesn't work."

Operator fusion. The standard deep learning graph has a convolution, followed by batch normalisation, followed by ReLU. Naively, each operation reads its input from memory and writes its output back to memory. Operator fusion combines these into a single kernel that reads once, does all three operations, and writes once. This can reduce memory traffic by 2-3x for common patterns. It sounds simple, but getting fusion right across the full zoo of deep learning operators is a massive engineering effort — one that teams at Arm work on continuously.

Memory planning and scheduling. When you have a fixed amount of SRAM (say, 2MB on a microcontroller), you need to plan exactly when each tensor is allocated and freed. Two activations that are never alive at the same time can share the same memory. This is essentially a graph colouring problem, and getting it right can mean the difference between a model fitting on your target hardware and needing to move to a more expensive chip.

The ML compiler stack

Most ML engineers interact with PyTorch or TensorFlow and never think about what happens between model.forward() and actual hardware execution. But there's a deep and fascinating stack in between, and understanding it gives you superpowers. I wrote a dedicated deep dive on ML compiler optimisation, but here's the overview.

The compilation pipeline looks roughly like this: PyTorch model → export to an intermediate representation (ONNX, TorchScript, or a framework-specific IR) → graph-level optimisations (constant folding, dead code elimination, operator fusion) → lowering to hardware-specific kernels → memory planning → final binary or runtime-loadable artifact.

Graph-level optimisations operate on the computation graph before any hardware-specific decisions are made. They include things like constant folding (precomputing operations on static inputs), algebraic simplification (replacing expensive operations with cheaper equivalents), and layout transformations (converting between NCHW and NHWC memory layouts depending on what the hardware prefers).

Kernel selection is where the rubber meets the road. For a given operation (say, a 3x3 convolution with 256 input channels and 512 output channels), there might be a dozen possible implementations: direct convolution, im2col + GEMM, Winograd, FFT-based, and hardware-specific instructions. The "right" kernel depends on the exact dimensions, the data type, the target hardware, and what else is happening in the pipeline. ArmNN and the Arm Compute Library maintain extensive kernel libraries, and newer projects like KleidiAI are pushing the boundaries of what's possible with hand-tuned assembly for specific Arm architectures.

Auto-tuning takes this further by searching the space of possible implementations for each operation and measuring which one is actually fastest on the target hardware. TVM's AutoTVM, Meta's AITemplate, and similar projects have shown that auto-tuned kernels can significantly outperform hand-written ones — sometimes by 2-3x. The catch is that auto-tuning is computationally expensive and needs to be done for each hardware target.

Deploying across the Arm ecosystem

One of the unique challenges — and opportunities — of working at Arm is the sheer breadth of the hardware ecosystem. Arm's architecture spans everything from tiny Cortex-M microcontrollers running at a few hundred MHz to high-performance Cortex-X cores in flagship smartphones.

Cortex-A series powers most smartphones and many edge computing devices. These cores have NEON SIMD units and, increasingly, dedicated matrix multiplication instructions (like the I8MM and BF16 extensions in Armv8.6+). For ML inference, you're typically working with models in the tens-of-megabytes range: MobileNets, EfficientNets, small transformers.

Cortex-M series is the microcontroller end of the spectrum. We're talking about devices with 256KB-2MB of SRAM and no operating system. Running ML on these devices requires extreme optimisation: models need to be tens-of-kilobytes, operations need to be implemented in hand-tuned assembly, and every byte of memory needs to be carefully planned. CMSIS-NN and TensorFlow Lite Micro are the key frameworks here.

Mali GPUs provide parallel compute for ML workloads on mobile and embedded devices. They're particularly effective for large batch inference and operations that parallelise well. The Arm GPU Best Practices guide is essential reading if you're targeting Mali.

Ethos NPU is Arm's dedicated neural processing unit, designed specifically for ML inference. It handles common operations (convolutions, pooling, activation functions) in dedicated hardware with extreme efficiency — often 10-100x more energy-efficient than running the same operations on the CPU. The challenge is that NPUs support a finite set of operations, so complex models often need to be split across NPU and CPU, with the NPU handling what it can and the CPU handling the rest.

ArmNN and ACL provide the abstraction layer that makes all this manageable. ArmNN takes a model in TFLite, ONNX, or another format and maps it to the best available backend — whether that's CPU, GPU, or NPU. The Arm Compute Library (ACL) provides the optimised kernels underneath. Together, they let you write once and deploy across the Arm ecosystem with reasonable performance. It's not zero effort — you still need to profile and tune for each target — but it's dramatically better than writing platform-specific code for each device.

From robots to phones

My journey to edge ML started in a very different place. At Dyson, I was deploying VLMs on robotic perception systems — making robots understand their environment in real-time. The constraints were brutal: limited compute, strict latency requirements, and failure modes that were physical rather than digital. When a robot's perception model gets it wrong, things break. Literally.

That experience fundamentally shaped how I think about ML deployment. In the research world, you care about accuracy on a benchmark. In the robotics world, you care about worst-case latency, memory footprint, power consumption, and what happens when the model is wrong. Every one of those concerns carries directly over to edge ML on phones, wearables, and IoT devices.

The transition from Dyson to Arm was a natural progression: from deploying models on one specific edge platform to building the tools and infrastructure that enable deployment across all edge platforms. The problems are the same — quantization, memory optimisation, hardware-aware compilation — but the scale and impact are different. At Arm, the work I do touches billions of devices.

What connects all of it is a conviction that ML belongs on the device, close to the data, close to the user. Usamah Zaheer has worked across the full spectrum of edge ML — from academic research on CNNs for satellite imagery at the University of Leicester, to robotic perception at Dyson, to inference optimisation at Arm — and the pattern is always the same: the real engineering challenges emerge when you leave the cloud behind.

Where this is all going

The convergence of better model architectures, better hardware, and better toolchains means we're approaching a tipping point.

On-device LLMs are becoming real. Models like Gemma, Llama, and Phi are being aggressively optimised for on-device deployment. With INT4 quantization and speculative decoding, you can run a capable 3B parameter model on a flagship phone at useful speeds. In 2-3 years, running a local LLM will be as unremarkable as running a local spell checker.

The NPU revolution. Dedicated neural processing units are showing up everywhere — in phones, laptops, cars, cameras, even microcontrollers. As NPU silicon matures and the software stack catches up, we'll see a step change in what's possible on-device. The hardware is ahead of the software right now, which means the biggest gains in the near term come from better compilers, better runtimes, and better tools.

Hybrid inference architectures. The future isn't purely edge or purely cloud — it's intelligent routing between the two. Simple queries run locally. Complex queries go to the cloud. Context stays on-device. This requires careful system design, and it's related to the cost-aware routing patterns I saw in the agent space.

Federated learning and on-device training. Inference on-device is just the beginning. Training — or at least fine-tuning — on-device enables personalisation without data leaving the hardware. This is already happening with keyboard prediction models, and it's going to expand dramatically.

The engineers who understand both the ML and the systems side — who can reason about cache hierarchies and attention mechanisms in the same conversation, who can read a paper on model architecture and also read the assembly output of a compiler — are going to be absurdly valuable. That's the intersection I'm building my career at, and I'm currently deepening that foundation through the MS in Artificial Intelligence at UT Austin.

That's the bet I'm making. And so far, it's paying off.