Engineering · 2026-04-26 · 6 min read

Why your AI agent works in the demo and breaks in production.

Most agent demos run for 30 seconds in a sandbox. Production agents have to work for years. Here's what we've learned shipping agents at enterprise scale — and the four patterns that actually hold up.

TL;DR

Agent demos optimize for the happy path. Production agents have to handle the long tail.
Four patterns separate working agents from impressive ones: explicit checkpoints, retrieval grounding, evaluation harnesses, and clean recovery paths.
The biggest failure mode isn't model capability. It's everything around the model.

Explicit checkpoints

Structured outputs at every step. The agent narrates intent before it acts.

Retrieval as grounding

Retrieval is decision-making, not plumbing. Cited, re-ranked, evaluated.

Eval harness for the long tail

Every production failure becomes a regression test. Tests grow with the system.

Clean recovery paths

Controlled failure with handoff, audit trail, and a useful next step.

The demos look great. A natural-language prompt, a multi-step task, a clean answer in 30 seconds. The investor is sold, the executive is sold, the project gets greenlit.

Six months later, the same agent in production is making decisions on real data, with real consequences, and the team that built it is in incident response — trying to figure out why it just routed a $40,000 refund to the wrong customer.

The gap between demo agents and production agents is bigger than most teams expect, and it isn't really about the model. The frontier models we work with — Claude, GPT, Gemini — are remarkable. They're not the bottleneck. The bottleneck is everything around them: how the agent handles ambiguous inputs, how it grounds itself in reality, how it knows when to ask for help, and how the team running it knows when to intervene.

Here's what we've learned shipping production agents at enterprise scale.

1. Explicit checkpoints, not implicit assumptions.

A working agent doesn't try to do the whole task in one go. It breaks the task into discrete steps, makes its current step legible to the system around it, and surfaces what it's doing before it does it.

In practice, this means structured outputs at every step, not free-form text. A working agent says "I'm about to call refund_customer(customer_id=4837, amount=4000)" before it does so. Then a second model — or a deterministic rule — checks that intent against policy, the customer's actual order history, and the agent's own confidence.

This pattern is unglamorous and adds latency. It is also the difference between agents that work and agents that don't.

2. Retrieval as grounding, not retrieval as input.

Most teams build RAG pipelines as input plumbing: pull some documents, stuff them in the context, hope the model uses them. That's not what retrieval does in a working agent.

In a production agent, retrieval is the agent's connection to reality. The agent's primary job is to decide what to retrieve, evaluate whether what came back actually answers the question, retrieve more or differently if it didn't, and surface its sources in the final answer.

This requires three things most teams skip: a retrieval evaluation harness, structured citations on every output, and a re-ranking step that actually filters out irrelevant chunks. None of these show up in the demo. All of them show up in the postmortem when the demo doesn't translate.

3. Evaluation harnesses that catch the long tail.

The hardest thing about evaluating agents is that the bugs you'll see in production are not the bugs you can write test cases for in advance. The agent will encounter a customer record with an unusual schema. It will get an empty result from a tool call and not handle it. It will get a response from another system that's technically valid but contextually wrong.

The teams that ship agents that work in production build eval harnesses that grow as the system grows: every production failure becomes a regression test, every new tool integration ships with adversarial test cases, and every model swap triggers a full eval run before deploy.

This is operational discipline more than model engineering. It looks like CI/CD for AI. It is the unglamorous work that makes the difference.

4. Clean recovery paths.

Agents fail. Production agents fail in front of users. The question is whether the failure is a controlled failure or an uncontrolled one.

Controlled failure looks like: the agent recognizes it's outside its competence, hands off to a human, logs everything that led up to the handoff, and the user is told what's happening.

Uncontrolled failure looks like: the agent confidently does the wrong thing, the user trusts it, and a week later the operations team is unwinding the damage.

The difference is design. Every production agent we ship has explicit "I don't know" output paths, escalation routes that take the user somewhere useful, and audit trails that let the operations team reconstruct what happened.

What this means for enterprise AI teams

If your agent works in the demo and breaks in production, the model isn't the problem. The architecture around the model is. Treat agent design as systems design — checkpoints, grounding, evaluation, recovery — and the demo-to-production gap closes.

If you're getting pressure to ship something fast and the team's instinct is to skip the harness because it slows down the demo, that's a signal worth listening to. The teams shipping working agents at scale are the teams that built the harness first.