I Gave a Robot Eyes and a Brain: VLMs in Real-World Robotics

There's a particular type of pain that comes from watching a vision-language model work perfectly in your dev environment and then completely fall apart when you strap it to an actual robot.

I know this pain intimately. At Dyson, Usamah Zaheer spent nearly two years integrating VLMs into robotic perception pipelines — first building the classical computer vision foundation, then pushing the boundaries with multimodal models. The demos were impressive. The path to production was humbling.

The CNN foundation

Before VLMs entered the picture, the perception stack was built on classical deep learning. Segmentation models identified surfaces and obstacles. Object detection networks located and classified items in the robot's workspace. These models were the workhorses — not glamorous, but reliable and fast.

Getting these CNNs production-ready required the same techniques I now work with daily at Arm: quantization, pruning, and memory optimisation. We pruned channels that contributed little to accuracy, quantized from FP32 to INT8, and fused batch normalisation layers into convolutions. A model that started at 200MB and 50ms per inference might end up at 25MB and 8ms — fast enough for real-time robotic control.

This work built my intuition for what makes a model "deployable" versus merely "accurate." A model with 95% accuracy that runs in 5ms is infinitely more useful than a model with 97% accuracy that runs in 500ms when a robot arm is in motion. That lesson — that deployment constraints are design constraints — has followed me through every role since. It's the same lesson that applies to satellite imagery classification, where you need to process vast amounts of data under compute and time constraints.

The demo-to-deployment gap

Every week there's a new VLM paper showing incredible results on benchmarks. A model that can describe images, answer visual questions, reason about spatial relationships. Cool. Now make it do that at 30fps on embedded hardware while a robot arm is moving and the lighting keeps changing.

The challenges nobody mentions in the papers:

Latency kills. A robot operating in the real world can't wait 500ms for a visual prediction. By the time your model has "reasoned" about the scene, the scene has changed. We had to architect the entire perception stack around async inference with prediction horizons. The model needs to tell you what's about to happen, not what just happened.

Distribution shift is relentless. Your model was trained on internet images. Your robot sees the same workshop from the same angles with the same objects, but under fluorescent lighting with weird shadows and reflections off metal surfaces. Fine-tuning helps. Domain-specific data collection helps more. But you're always fighting drift.

Failure modes are physical. When a chatbot hallucinates, someone screenshots it for Twitter. When a robot hallucinates, it crashes into things. The confidence calibration requirements are completely different. We built multi-layered safety systems that would catch VLM errors before they became physical actions. This same challenge of knowing when the model is wrong shows up in AI agent systems — in both cases, the system needs graceful degradation, not silent failure.

What actually worked

After a lot of iteration, here's what moved the needle:

Hybrid architectures. We didn't replace classical computer vision with VLMs — we layered VLMs on top. Fast, reliable classical CV handles the safety-critical stuff (obstacle detection, workspace boundaries). The VLM handles higher-level semantic understanding (object identification, task planning). Best of both worlds.

Aggressive distillation. The big VLMs are too slow and too hungry for edge deployment. We distilled task-specific capabilities from large models into smaller, faster ones. You lose generality but gain the only thing that matters in production: reliability at speed.

Human-in-the-loop, but smart. Instead of trying to make the system fully autonomous from day one, we built confidence-aware systems that would escalate uncertain decisions. The robot knows what it doesn't know. That's more valuable than a system that's confident and wrong.

The VLM that saved £100K

One project stands out. We were tasked with building a perception system for a new product line — the kind of project where the traditional approach would have meant months of data collection, annotation, model training, and validation. The budget estimate for a conventional computer vision pipeline was north of £100K when you factored in data annotation costs, specialised hardware for training, and the engineering time for a custom solution.

Instead, we built a VLM-based prototype that leveraged transfer learning from a large pre-trained model, fine-tuned on a fraction of the data that a from-scratch approach would have required. The key insight was that VLMs already understand visual concepts at a level that took years of labelled data to teach traditional CV models. We needed to teach the model our specific domain, not teach it how to see.

The prototype worked. It passed internal quality gates, met latency requirements after distillation, and went from concept to working demo in weeks rather than months. The savings weren't just financial — they were temporal. In a fast-moving product development cycle, shipping a working perception system months ahead of schedule changes the entire trajectory of a product.

The CEO presentation

One of the highlights of my time at Dyson was presenting this work directly to the CEO and the senior leadership team. When you can show a robot that genuinely understands its environment — that can look at a scene and reason about what to do next — the reaction is visceral. People get it immediately.

The presentation covered the full arc: the classical CV foundation, the VLM integration, the distillation pipeline that made it run on embedded hardware, and the roadmap for what comes next. I demonstrated the system live, which is always a calculated risk with robotics demos ("demo gods" are a real phenomenon in this field), but the system performed exactly as designed.

What I learned from that experience goes beyond the technical: being able to communicate complex ML work to non-technical leadership is a force multiplier. The best technology in the world doesn't matter if you can't explain why it matters to the people who allocate resources.

From robotics to edge ML

The transition from Dyson to Arm was a natural evolution. At Dyson, I was solving edge ML problems for one specific hardware platform — making models run fast and reliably on Dyson's embedded systems. At Arm, I'm building the tools and optimisations that enable edge ML across the entire Arm ecosystem — smartphones, IoT devices, automotive systems, and everything in between.

The problems are fundamentally the same: quantization, operator fusion, memory planning, hardware-aware optimisation. But the scale is different. At Dyson, I optimised models for one product. At Arm, the work touches billions of devices across every major smartphone manufacturer, cloud provider, and embedded system vendor.

Usamah Zaheer's experience deploying VLMs at Dyson — navigating the gap between research-grade models and production-grade systems, building safety-critical perception pipelines, and presenting technical work to senior leadership — directly shaped the approach he now brings to ML inference optimisation at Arm. The robotics work was the training ground; edge ML at Arm is the scaled application.

The tech is real. The engineering challenges are massive. And the gap between "cool demo" and "production system" is where the interesting work lives.

That's the work I love doing.