You’ve deployed an AI agent. It runs. It responds. The vendor demo looked great. But six months later, someone in a management meeting asks the obvious question: “Is this thing actually working?”
Most teams cannot answer that cleanly. Not because the agent failed — it may be performing well — but because nobody agreed up front on what “working” means. Choosing the right AI agent KPIs before (or just after) go-live is the difference between running a useful system and running an expensive demo you’re too embarrassed to turn off.
This article covers the metrics that matter for production AI agents — the ones that tell you whether the agent is earning its keep, where it’s breaking down, and when to intervene.
Why “It Answered the Question” Is Not a KPI
The instinct when an agent launches is to track satisfaction scores or response accuracy at a high level. These aren’t useless, but they don’t give you operational control. An agent can respond fluently and still route every fifth request to a human, drive up your LLM costs, or quietly fail on a whole category of edge cases your test suite never caught.
Operational AI agent performance measurement requires metrics that map directly to business outcomes: time saved, cost per outcome, and the rate at which the agent closes tasks without human intervention.
The framework below groups KPIs into three layers: task completion, escalation and failure, and economics.
Layer 1: Task Completion — Is the Agent Actually Finishing Work?
Containment Rate
Containment rate is the percentage of incoming requests the agent resolves end-to-end without a human stepping in. For a customer support agent, that means the customer got an answer and closed the conversation. For a document-processing agent, it means the document was classified, extracted, and routed without a human reviewer touching it.
Why it’s the headline metric: every point of containment is a unit of human time freed. A support agent handling 400 tickets a week at 60% containment closes 240 without human involvement. At 80%, that’s 320.
There’s no universal “good” rate — it depends on the task. A tightly scoped FAQ agent should hit 70–85% or higher. A document-intake agent dealing with varied formats might be doing well at 50–60%. Set the baseline in week one, then track the trend.
Task Completion Rate vs. Deflection Rate
These two get conflated. Deflection rate simply measures how often the agent avoids a human touchpoint — it doesn’t confirm the request was resolved. A user who leaves after a non-answer inflates your deflection rate without adding value.
Task completion rate tracks whether the user’s actual goal was met: the booking was made, the refund was processed, the information was found and confirmed. Pairing completion rate with deflection rate tells you if you’re deflecting or just abandoning.
Layer 2: Escalation and Failure — Where Is the Agent Breaking?
Escalation Rate (and Escalation Categories)
Escalation rate is the inverse of containment: the share of requests that end up with a human. Tracking the number alone isn’t enough. You need to know why escalations happen.
Most agent platforms expose escalation triggers in logs. Common categories:
- Intent not recognised — the agent didn’t understand the request
- Confidence below threshold — understood but not certain enough to act
- Policy boundary — by design, requires human approval
- User-initiated — the user explicitly asked for a human
The first two are actionable. If 30% of your escalations are “intent not recognised,” you have a prompt design problem. If most are policy-driven, your thresholds may be set too conservatively.
Hallucination and Error Rate
For agents that retrieve and surface information — knowledge base queries, document summaries, FAQ responses — tracking factual accuracy matters more than most teams realise. Manual spot-checks on a sample of responses, combined with any flagging users do, give you a practical signal here.
Automated evals — LLM-as-judge scoring against a ground-truth set — are more systematic. The Testing AI Agents: How Evals Keep Automation Trustworthy article covers how to set that up without it becoming a research project.
Time-to-Resolution
For workflows with a defined end state — a ticket resolved, a form submitted, an appointment confirmed — time-to-resolution is a clean metric. Compare agent-handled vs. human-handled resolution times. The gap is the efficiency story you tell internally.
One honest caveat: some tasks should go to humans and should take longer there. Time-to-resolution should be measured per task category, not averaged across everything.
Layer 3: Economics — What Is Each Outcome Costing You?
This is where AI agent metrics connect to the business case. The three metrics that matter most:
Cost Per Completed Task
Take your total agent running cost for a period — LLM API calls, infrastructure, any platform fees, plus a fair allocation of engineering maintenance time — and divide by the number of tasks completed. Compare this to the fully-loaded cost of a human completing the same task.
Illustrative scenario: a mid-sized e-commerce operation processes 3,000 return requests per month. Each takes roughly 8 minutes of human time at a blended employer cost of approximately CHF 0.65–0.85/minute (Swiss median wages for customer-facing staff plus employer social contributions of around 19%, per FSO Earnings Structure Survey data), so roughly CHF 5–7 per request. If the agent handles 70% of those at a total running cost of CHF 900/month, the per-task cost on agent-handled requests drops to well under CHF 1. This is illustrative math — your actual numbers depend on LLM usage, labour costs, and maintenance overhead — but this is the structure of the calculation.
LLM Token Cost per Task
As your agent scales, LLM API costs become a meaningful variable. Track tokens consumed per task, broken out by model. Long, unfocused system prompts and retrieval pipelines that return too much context drive this up unnecessarily — monitoring it flags inefficiency before it compounds.
Human Time Redirected
What are the humans doing now that the agent handles routine load? If the escalations that reach your team are genuinely higher-complexity or higher-value work, the agent is doing its job. If humans are reformatting things the agent produced imperfectly, you have a quality problem that cost-per-task alone won’t surface.
A Practical KPI Dashboard for Production Agents
Most teams overcomplicate this. For a production agent, start with six numbers tracked weekly:
| Metric | What it tells you | Target direction |
|---|---|---|
| Containment rate | Tasks closed without human | Up |
| Task completion rate | Goals actually met | Up |
| Escalation rate by category | Where the agent breaks | Intent/confidence categories down |
| Error / hallucination rate | Output quality | Down |
| Cost per completed task | Economics | Down over time |
| Time-to-resolution (agent vs. human) | Efficiency gap | Agent faster |
Review these monthly against the baseline you set at deployment. Flat containment rate after two months is a signal to investigate. Rising cost per task while containment holds usually means your prompts or retrieval pipeline need trimming.
The Measuring the ROI of AI Agents: A Framework for SMBs article covers how to build the financial case from these numbers once you have a few months of data.
When KPIs Signal It’s Time to Rethink the Agent
Sometimes the metrics tell you the agent design itself needs to change — not just the prompts or thresholds. Warning signs:
- Containment rate plateaus below 50% despite multiple iterations
- Escalation categories show the same unrecognised intents week after week
- Users systematically override the agent rather than accepting its output
- Cost per task is higher than the human baseline and not improving
These patterns usually point to a scoping problem: the agent was given too broad a task, or was deployed in a context where the input variability is higher than the agent’s design can handle. The From Pilot to Fleet: Managing AI Agents in Production article covers how to think about this systematically.
The harder failure mode is an agent that looks fine statistically but is slowly degrading trust. Qualitative signals matter here: support tickets about the agent, user feedback, and the rate at which customers escalate to a human within 24 hours of a “resolved” session.
What Good Measurement Looks Like From Day One
The teams that measure AI agents well share one practice: they define success criteria before the agent goes live, not after. What containment rate justifies the investment? What escalation rate triggers a prompt redesign review? What error rate is acceptable for your industry and task type?
These aren’t guesses — they’re negotiated between whoever owns the business outcome and whoever owns the technical build. Without that negotiation, every review meeting becomes an argument about whether 62% containment is good or disappointing.
If you’re in the process of planning or reviewing a deployment and want a clear view of which metrics fit your use case, Orange ITS runs a focused 30-minute session to map the right KPI set to your specific agent and business context. Get in touch to book that call.
For broader operational design questions, our Process Optimisation service covers how we instrument, monitor, and iterate on agents in production — including the measurement frameworks we use for client deployments.
The Why AI Agent Projects Fail — and How to De-Risk Yours article also covers measurement gaps as one of the most common failure modes, if you want to see how KPI blind spots contribute to broader project risk.