Inside the Black Box: ML Compiler Optimisation from a Practitioner at Arm

Most ML engineers treat the compiler as a black box. You export your model, point a tool at it, and hope the output is fast. I work inside the box.

I'm Usamah Zaheer, and at Arm, my work sits at the intersection of ML models and the hardware they run on. A significant part of that work involves understanding — and improving — the compilation pipeline that transforms a high-level model definition into efficient machine code. This is the layer of the stack that most ML engineers never see, but it determines whether your model runs in 5ms or 50ms on the same hardware.

From PyTorch to binary

The journey from model.forward() to actual hardware execution is longer and more complex than most people realise. Here's the pipeline:

Step 1: Export. Your PyTorch model gets exported to an intermediate representation — ONNX, TorchScript, or a framework-specific IR like StableHLO. This step captures the computation graph: which operations happen, in what order, and with what shapes. The export process needs to resolve all dynamic Python control flow into a static graph, which is why torch.export can be finicky with models that have data-dependent branching.

Step 2: Graph-level optimisation. The IR gets fed through a series of graph transformation passes. These are hardware-independent optimisations that simplify the computation without changing its semantics. More on this below.

Step 3: Lowering. The optimised graph gets lowered to hardware-specific representations. Abstract operations like "convolution" become concrete implementations — specific kernels chosen for the target hardware. This is where the compiler needs to know whether it's targeting a Cortex-A78 with NEON SIMD units, a Mali GPU, or an Ethos NPU.

Step 4: Memory planning. The compiler determines when each intermediate tensor is allocated and freed, minimising peak memory usage. On edge devices with limited SRAM, this step can determine whether a model fits on the hardware at all.

Step 5: Code generation. The final step produces executable code — either native machine code, or a serialised plan that a runtime (like ArmNN) interprets at inference time.

Each step involves trade-offs, and each step is an opportunity for optimisation. The best ML compilers make good decisions at every stage. The gap between a good compiler and a great compiler can be 2-5x in inference latency.

Graph-level optimisations

Graph-level optimisations are the "free lunch" of ML compilation — they improve performance without changing the model's behaviour. Here are the most impactful ones:

Operator fusion. This is the big one. A typical neural network graph has sequences like Conv → BatchNorm → ReLU that appear hundreds of times. Naively, each operation reads its input from memory, computes, and writes its output back. Fusion combines these into a single kernel: read once, compute all three operations, write once. For memory-bandwidth-limited devices (which is most edge hardware), fusion can deliver 2-3x speedups.

The art is in knowing which operators can be fused. Simple linear chains are straightforward, but real models have branches, skip connections, and operations with multiple consumers. Modern compilers use pattern-matching to identify fusible subgraphs and cost models to decide which fusions are actually beneficial.

Constant folding. Any operation whose inputs are all known at compile time can be computed once and replaced with its result. This sounds obvious, but it cascades — folding one operation might make another operation's inputs constant, enabling further folding. Batch normalisation parameters, for example, are constants after training, which means BN can often be folded entirely into the preceding convolution's weights and biases.

Layout transformation. Deep learning frameworks typically use NCHW (batch, channels, height, width) memory layout. But many hardware accelerators prefer NHWC, and some prefer even more exotic layouts like NC/xHWc (blocked channel layouts for SIMD). Inserting layout transformation operations at the right points in the graph — and minimising redundant transformations — is a graph-level optimisation with significant performance implications.

Dead code elimination and common subexpression elimination. Standard compiler optimisations that apply to ML graphs too. If two branches of a model compute the same thing, compute it once. If a branch's output is never used, remove it. These are less dramatic than fusion but add up across large models.

Kernel selection and auto-tuning

This is where the compiler makes its most consequential decisions. For a given operation — say, a 3×3 depthwise convolution with 128 channels in INT8 on a Cortex-A76 — there are multiple possible implementations:

Direct convolution loops over the spatial dimensions and accumulates products. Simple, but cache-unfriendly for large inputs.

Im2col + GEMM transforms the convolution into a matrix multiplication, which can leverage highly optimised GEMM kernels. The overhead is the im2col transformation itself, which requires extra memory and compute.

Winograd convolution reduces the number of multiplications by using a mathematical transformation, at the cost of more additions and some numerical precision. Particularly effective for 3×3 kernels.

Hardware-specific implementations use dedicated instructions. On Arm, the SDOT and SMMLA instructions perform INT8 matrix operations natively, and hand-tuned kernels built with these instructions can be significantly faster than generic implementations.

The Arm Compute Library (ACL) and KleidiAI maintain libraries of optimised kernels for different operations, data types, and hardware targets. ArmNN's role is to select the right kernel for each operation based on the target hardware, input shapes, and data types.

Auto-tuning goes further by empirically measuring kernel performance on the target hardware rather than relying on cost models. You generate multiple candidate implementations for each operation, run each one, measure latency, and choose the winner. TVM's AutoTVM and Ansor, and Meta's AITemplate take this approach. The downside is compilation time — auto-tuning a full model can take hours — but the results are often worth it for deployment targets where you'll be running millions of inferences.

The profiling feedback loop

Optimisation without measurement is guesswork. The profiling feedback loop is how you turn guesswork into engineering:

Profile the model end-to-end. Use PyTorch Profiler, ArmNN's built-in profiling, or framework-agnostic tools to identify the slowest operations. In my experience, 80% of inference time is typically spent in 20% of the operations. Focus there.

Identify the bottleneck type. Is the slow operation compute-bound or memory-bound? Use tools like Arm Streamline or hardware performance counters to measure IPC (instructions per cycle), cache miss rates, and memory bandwidth utilisation. A compute-bound operation might benefit from a faster kernel or lower precision. A memory-bound operation might benefit from operator fusion or better tiling.

Micro-benchmark alternatives. Once you've identified the bottleneck, try alternative implementations. Different kernel, different data layout, different precision. Measure each one. Don't trust intuition — measure. I've been surprised more times than I can count by which implementation turns out to be fastest.

Iterate. Optimise the top bottleneck, then re-profile. The performance landscape shifts as you optimise — fixing one bottleneck often reveals the next one. The profiling loop is never "done," but there's usually a point of diminishing returns where the remaining operations are already near-optimal for the hardware.

Tools I use regularly: PyTorch Profiler for high-level operation timing, Valgrind (Callgrind) for CPU instruction-level profiling, gprof for call-graph analysis, Arm Streamline for hardware performance counter data, and custom timing instrumentation for production latency monitoring.

Why this matters for you

Even if you never write a compiler pass or hand-tune a kernel, understanding this stack makes you a better ML engineer. Here's why:

You'll make better architecture decisions. When you know that depthwise separable convolutions are efficient not just because they have fewer FLOPs, but because they map well to SIMD instructions and have cache-friendly access patterns, you'll make better trade-offs when designing or selecting model architectures.

You'll debug performance issues faster. When your model is slower than expected, you'll know where to look. Is it a memory bandwidth bottleneck? An unfused operation sequence? A layout mismatch? Understanding the compilation pipeline gives you the vocabulary and mental model to diagnose these issues.

You'll write more deployment-friendly models. Models that follow compiler-friendly patterns — regular shapes, standard operations, consistent data types — compile and optimise better. The difference between a model that's "theoretically efficient" and one that's "actually fast on hardware" often comes down to how well it interacts with the compilation pipeline.

Understanding the compiler stack is what separates ML engineers who ship models from ML engineers who ship fast models. It's also what connects my work at Arm to the broader edge ML inference challenge and to the practical deployment constraints I encountered at Dyson. The compiler is the bridge between the model and the metal — and it's a bridge worth understanding.