From Pilot to Fleet: Managing AI Agents in Production

Most AI agent deployments start the same way: one agent, one use case, one enthusiastic project sponsor. The pilot works. Leadership approves a second agent, then a third. Within eighteen months, a mid-sized operations team is running five or six agents across support, sales, finance, and logistics — and nobody has a clear picture of what any of them are doing at any given moment.

That is when AI agent management stops being a technical afterthought and becomes a business-critical discipline.

This article covers what you need to monitor, how to handle versioning and model updates, when to escalate to humans, and the less-discussed question of when to consolidate your fleet rather than keep expanding it.

Why a Pilot That Worked Can Break in Production

A single agent running in a controlled environment is forgiving. You can watch it closely, correct it manually, and tolerate the occasional odd output. Scale that to a fleet operating across departments, and the failure modes multiply in ways that are hard to predict.

Three patterns appear repeatedly once organisations move past the initial deployment:

Prompt drift. An agent’s behaviour changes not because anyone touched the code, but because the underlying model was updated by the provider — or because the data the agent retrieves has quietly shifted. A support agent that handled refund queries correctly for six months starts misclassifying escalations. Nobody changed anything. Everything changed anyway.

Silent failures. Unlike a crashed server, a misbehaving agent often keeps running. It completes tasks, logs success, and passes back outputs that look plausible. The failure is in the quality of those outputs — and it can go unnoticed for weeks unless you are measuring the right things.

Dependency sprawl. Each agent added to the fleet typically connects to one or more external tools: a CRM, a document store, an API. When one of those dependencies changes or goes down, the agent’s behaviour degrades in ways that can be hard to trace back to the root cause.

None of this is insurmountable. But it requires deliberate infrastructure around AI agent operations — not just the agents themselves.

The Four Pillars of AI Agent Management at Scale

1. Observability: Knowing What Your Agents Are Actually Doing

Observability in AI agent management means more than uptime dashboards. You need visibility into three distinct layers:

Task completion rates — not just whether the agent ran, but whether it completed the intended task correctly. A support agent that deflects 80% of tickets but misclassifies 30% of them has a completion rate that looks fine and a quality rate that does not.
Latency and cost per run — especially relevant when agents call external LLMs. Token costs can compound quickly at scale, and latency spikes often signal that a tool call is hanging or a retrieval step is degrading.
Output sampling — a mechanism for regularly reviewing a random sample of agent outputs, either manually or through an automated evaluator. This is the only reliable way to catch quality drift before it becomes a customer-facing problem.

Connecting this to your broader metrics is important. The KPIs that prove agents are working should feed into the same dashboards where you track operational performance — not live in a separate “AI dashboard” that nobody checks on a Friday afternoon.

2. Versioning and Change Management

Agents in production are not static. Prompts get refined, tools change, model providers release new versions, and business logic evolves. Without version control, you lose the ability to diagnose regressions and roll back safely.

Treat your agents like software. That means:

Prompts and configurations stored in version control alongside application code
Staging environments where changes are tested against representative inputs before reaching production
A clear policy on who can approve changes to agents that touch sensitive data or customer-facing outputs

Model versioning deserves special attention. When a model provider releases a new version, the default is often to upgrade automatically. For agents in production, automatic upgrades are a risk. Pin your model versions explicitly and treat upgrades as a deployment event, not a routine update. Run your existing evaluations and test suites against the new version before switching.

3. Human Escalation Paths That Actually Work

Every agent in production needs a defined escalation path — a clear answer to the question: what happens when this agent should not handle a situation on its own?

The failure mode is not usually that escalation logic is absent. It is that the logic exists but the handoff breaks down in practice. Common issues:

The agent escalates to a human queue that nobody monitors consistently
Escalation triggers are set too conservatively (the agent escalates everything ambiguous) or too permissively (it handles things it should not)
Escalated cases arrive without sufficient context, forcing the human to reconstruct what the agent already tried

A working escalation path has three properties: it is triggered reliably, it delivers context alongside the case, and someone is genuinely responsible for handling it. The third point sounds obvious. In organisations where AI agents have been bolted onto existing workflows, accountability for escalated cases is often genuinely unclear.

For multi-agent deployments, escalation design becomes more complex. See AI agent orchestration for how routing and fallback logic work when agents hand off between themselves.

4. Governance: Who Owns the Fleet

A fleet of agents without clear ownership is an operational liability. Governance here does not mean bureaucracy — it means answering a handful of practical questions:

Who can approve changes to agent behaviour in production?
Who is responsible when an agent takes an action that causes a problem?
How are new agents reviewed before deployment?
How often are existing agents audited?

A lightweight “agent register” — a living document listing each agent in production, its owner, its scope, its data dependencies, and its last review date — pays for itself the first time something goes wrong at 11pm on a Tuesday.

The AI agent governance playbook covers the organisational side of this in more depth, including how to structure oversight without slowing down iteration.

Consolidate vs. Add: The Decision Most Organisations Get Wrong

Once a fleet is running, the natural instinct is to solve new problems by adding agents. The fleet grows; complexity grows faster.

Sometimes a new agent is the right answer. But often, extending an existing agent or consolidating two overlapping ones is cleaner. Signs it is time to consolidate rather than add:

Two agents are querying the same data sources for related tasks
Hand-offs between agents are a frequent source of errors or lost context
Maintenance load is growing faster than business value

The test is straightforward: if merging or extending an existing agent would reduce the total number of moving parts without sacrificing performance, that is usually the right call. A smaller, well-maintained fleet is easier to govern, cheaper to run, and more resilient to dependency changes than a sprawling collection of narrowly scoped agents.

What a Managed Fleet Looks Like in Practice

Consider a professional services firm with 50 staff running four agents: inbound lead qualification, document summarisation, client support, and IT helpdesk.

Without an operations framework, each agent runs in isolation. Changes are ad hoc, token costs are invisible, and nobody knows which agent is generating the most escalations. With even lightweight management in place — version-controlled configs, weekly output sampling, named owners, a shared cost dashboard — the picture changes fast. The firm spots that the support agent’s escalation rate jumped two weeks ago (a knowledge base update introduced outdated information), that the IT helpdesk agent is handling 40 tickets a week at an estimated cost of $40–$120 (at roughly $1–$3 per AI-resolved ticket), compared to $600–$900 in equivalent staff time at industry-average rates of $15–$22 per human-handled ticket (per MetricNet and BMC benchmarks), and that the summarisation agent is stable enough to take on a second document type.

That gap — running agents versus managing them — is where most ROI is either realised or lost.

The Operational Maturity You Want Before You Scale Further

Before adding the next agent to your fleet, it is worth asking whether the existing ones are actually managed. A checklist worth running:

Each agent has a named owner responsible for its performance
Prompts and configurations are version-controlled and reviewed before changes
You are sampling and reviewing agent outputs regularly, not just monitoring uptime
Model versions are pinned and upgraded deliberately, not automatically
Escalation paths are tested and have clear human accountability
Total fleet costs are visible in a single place
There is a defined process for retiring an agent that is no longer providing value

If several of these are gaps, building the management layer before the next deployment will save significant remediation effort later. The AI agent development work only pays off when the operational infrastructure can keep it performing.

Ready to Move From Pilots to a Managed Fleet?

If you are running agents in production and the operational picture is less clear than you would like, a focused conversation can identify the highest-priority gaps quickly.

Book a 30-minute call with the Orange ITS team to review your current agent fleet and identify where a management framework would have the most impact. No slide decks — just a practical assessment of where you are and what would actually move the needle.

Frequently asked questions

Why do AI agents fail silently in production?

Unlike a crashed server, a misbehaving agent often keeps running, completing tasks, logging success, and returning plausible-looking outputs while quality quietly degrades. This can go unnoticed for weeks unless you regularly sample and review actual agent outputs rather than only monitoring uptime.

What should you monitor for AI agents in production?

Three layers: task completion rates measured by whether the task was done correctly, not just executed; latency and cost per run, since token costs compound at scale; and regular output sampling, either manual or automated, which is the only reliable way to catch quality drift before it reaches customers.

Should AI agents automatically upgrade to new model versions?

No. Automatic model upgrades are a risk for production agents because behavior can change without anyone touching your code. Pin model versions explicitly, treat provider upgrades as a deployment event, and run your existing evaluations against the new version before switching.

How do you design a human escalation path for an AI agent?

A working escalation path has three properties: it triggers reliably, it delivers context alongside the case so the human does not have to reconstruct what the agent already tried, and a named person is genuinely accountable for handling it. Most failures happen not because escalation logic is missing but because the handoff breaks down in practice.

When should you consolidate AI agents instead of adding more?

Consolidate when two agents query the same data sources for related tasks, when hand-offs between agents frequently cause errors or lost context, or when maintenance load grows faster than business value. If merging or extending an existing agent reduces moving parts without hurting performance, that is usually the right call.