What I Learned Building AI Agents at a Stealth Startup

January 10, 2025

Before joining Arm, I spent time at a stealth startup building AI agent systems from the ground up. No existing codebase, no playbook, no "just follow the docs." Pure 0-to-1 engineering.

I'm Usamah Zaheer, and this is what I learned that the Twitter discourse consistently gets wrong.

Agents are not just prompt chains

The most common misconception about AI agents is that they're just LLM calls with tools. Chain some prompts together, give the model access to APIs, and boom — you have an agent.

No. What you have is a very expensive and unreliable script.

Real agent systems need:

State management that actually works. Agents need to maintain context across long-running tasks, handle interruptions gracefully, and resume from failures without losing progress. This is a distributed systems problem, not an AI problem. We spent more time on the state machine than on the prompts.

Reliable tool use. LLMs are probabilistic. Your database is not. The interface between "model thinks it should query the database" and "correct SQL actually executes" is where most agent systems break. We built validation layers, type-safe tool interfaces, and extensive error handling. The boring stuff that makes the cool stuff actually work.

Cost awareness. Running an agent that makes 50 LLM calls to complete a task sounds fine until you multiply that by thousands of users. We built cost-aware routing that would use smaller models for simple subtasks and only escalate to larger models when needed. Your agent architecture is also a business model decision. This same principle — matching compute to complexity — shows up in edge ML inference too, where you route workloads between CPU, GPU, and NPU based on the operation's requirements.

The infrastructure nobody sees

The sexy part of building agents is the prompt engineering and the tool design. The unsexy part — the part that actually determines whether your system works at scale — is the infrastructure.

Orchestration and deployment. Our agents ran on Kubernetes with Docker containers, managed through Vertex AI pipelines. Each agent had its own resource profile: some were CPU-bound (lots of text processing), others needed GPU access for embedding generation. Getting the autoscaling right — spinning up agent instances in response to demand without burning money on idle compute — took months of iteration.

Vector stores and semantic search. Agents need to retrieve relevant context to do their jobs. We built a retrieval layer on top of vector databases (Pinecone, then Weaviate) with careful attention to chunking strategies, embedding model selection, and reranking. The quality of your retrieval pipeline directly determines the quality of your agent's outputs. Garbage context in, garbage actions out.

Observability. When an agent makes a mistake, you need to understand why. We built comprehensive logging with MLflow for experiment tracking and custom dashboards for monitoring agent behaviour in production. Every LLM call, every tool invocation, every decision point was logged with enough context to reconstruct the agent's reasoning. Without this, debugging agent failures is like debugging a distributed system with print statements.

Data pipelines. Agents don't operate in a vacuum — they need access to structured and unstructured data, often from multiple sources. We used Databricks for data engineering, building ETL pipelines that kept the agent's knowledge base fresh. Stale data means stale agents.

RAG vs. fine-tuning: a decision framework

One of the most consequential architectural decisions in any agent system is whether to use retrieval-augmented generation (RAG), fine-tuning, or some combination of both. After building systems with both approaches, here's my framework:

Use RAG when: Your knowledge base changes frequently, you need auditability (users want to see the sources), you have a limited compute budget for training, or you need to support multiple domains without separate models. RAG is also more forgiving of mistakes — you can fix retrieval issues by updating the knowledge base without retraining anything.

Use fine-tuning when: You need the model to deeply internalise a specific style, format, or domain vocabulary. When consistent output structure matters more than factual accuracy (which the retrieval layer handles). When latency is critical and you want to avoid the retrieval round trip.

Use both when: You need a model that speaks your domain's language fluently (fine-tuning) but also needs access to up-to-date information (RAG). This hybrid approach is more complex to maintain but produces the best results for production agent systems.

The trap most teams fall into is starting with fine-tuning because it feels more "real" than RAG. Fine-tuning is expensive, slow to iterate on, and creates a brittle dependency on a specific model version. RAG lets you ship faster and iterate on the knowledge base independently of the model. Start with RAG. Add fine-tuning when you've proven the value and understand the failure modes.

The hype vs. reality gap

Here's my honest assessment of where AI agents are right now:

Overhyped: Fully autonomous agents that can replace knowledge workers. We're not there. The error rates compound over long task chains, and the failure modes are unpredictable enough that you need human oversight for anything consequential. The confidence calibration problem — knowing when an agent is likely to be wrong — is the same challenge I faced when deploying VLMs on robots at Dyson. In both cases, the system needs to know what it doesn't know.

Underhyped: Agents as productivity multipliers for skilled operators. Give an expert a well-built agent and they'll 10x their output. The agent handles the tedious parts, the human handles the judgment calls. This is the real product opportunity right now.

Correctly hyped: The pace of improvement. The models are getting better fast. The tool use capabilities are getting more reliable. What doesn't work today might work in six months. Building in this space means building on a moving foundation, and that's both exciting and terrifying.

Why I chose UT Austin

After the startup experience, I had a choice: keep shipping products or go deeper on the fundamentals. I chose both — joining Arm for the applied work and UT Austin's MS in Artificial Intelligence for the theoretical depth.

The startup taught me that the biggest bottleneck in building agent systems isn't the model — it's understanding the underlying principles well enough to know what's possible and what's a dead end. Too many teams are brute-forcing their way through problems that have elegant solutions in the literature. Courses in natural language processing, reinforcement learning, and probabilistic graphical models directly address the foundations that agent systems are built on.

Usamah Zaheer's decision to pursue the MS in AI at UT Austin while working full-time at Arm reflects a conviction that the best engineers in the agent space will be the ones who combine deep theoretical knowledge with hands-on systems experience. The agent space is going to be massive. But the winners won't be the ones who move fastest — they'll be the ones who build on the deepest foundations.

I want to be the engineer who's read the papers AND shipped the product. The one who can look at a problem and know whether to reach for a transformer or a finite state machine. That combination is rare, and I think it's where the leverage is.

That's the plan, anyway. So far, no regrets.