The KPIs That Prove Your AI Agents Are Working

You’ve deployed an AI agent. It runs. It responds. The vendor demo looked great. But six months later, someone in a management meeting asks the obvious question: “Is this thing actually working?”

Most teams cannot answer that cleanly. Not because the agent failed — it may be performing well — but because nobody agreed up front on what “working” means. Choosing the right AI agent KPIs before (or just after) go-live is the difference between running a useful system and running an expensive demo you’re too embarrassed to turn off.

This article covers the metrics that matter for production AI agents — the ones that tell you whether the agent is earning its keep, where it’s breaking down, and when to intervene.

Why “It Answered the Question” Is Not a KPI

The instinct when an agent launches is to track satisfaction scores or response accuracy at a high level. These aren’t useless, but they don’t give you operational control. An agent can respond fluently and still route every fifth request to a human, drive up your LLM costs, or quietly fail on a whole category of edge cases your test suite never caught.

Operational AI agent performance measurement requires metrics that map directly to business outcomes: time saved, cost per outcome, and the rate at which the agent closes tasks without human intervention.

The framework below groups KPIs into three layers: task completion, escalation and failure, and economics.

Layer 1: Task Completion — Is the Agent Actually Finishing Work?

Containment Rate

Containment rate is the percentage of incoming requests the agent resolves end-to-end without a human stepping in. For a customer support agent, that means the customer got an answer and closed the conversation. For a document-processing agent, it means the document was classified, extracted, and routed without a human reviewer touching it.

Why it’s the headline metric: every point of containment is a unit of human time freed. A support agent handling 400 tickets a week at 60% containment closes 240 without human involvement. At 80%, that’s 320.

There’s no universal “good” rate — it depends on the task. A tightly scoped FAQ agent should hit 70–85% or higher. A document-intake agent dealing with varied formats might be doing well at 50–60%. Set the baseline in week one, then track the trend.

Task Completion Rate vs. Deflection Rate

These two get conflated. Deflection rate simply measures how often the agent avoids a human touchpoint — it doesn’t confirm the request was resolved. A user who leaves after a non-answer inflates your deflection rate without adding value.

Task completion rate tracks whether the user’s actual goal was met: the booking was made, the refund was processed, the information was found and confirmed. Pairing completion rate with deflection rate tells you if you’re deflecting or just abandoning.

Layer 2: Escalation and Failure — Where Is the Agent Breaking?

Escalation Rate (and Escalation Categories)

Escalation rate is the inverse of containment: the share of requests that end up with a human. Tracking the number alone isn’t enough. You need to know why escalations happen.

Most agent platforms expose escalation triggers in logs. Common categories:

Intent not recognised — the agent didn’t understand the request
Confidence below threshold — understood but not certain enough to act
Policy boundary — by design, requires human approval
User-initiated — the user explicitly asked for a human

The first two are actionable. If 30% of your escalations are “intent not recognised,” you have a prompt design problem. If most are policy-driven, your thresholds may be set too conservatively.

Hallucination and Error Rate

For agents that retrieve and surface information — knowledge base queries, document summaries, FAQ responses — tracking factual accuracy matters more than most teams realise. Manual spot-checks on a sample of responses, combined with any flagging users do, give you a practical signal here.

Automated evals — LLM-as-judge scoring against a ground-truth set — are more systematic. The Testing AI Agents: How Evals Keep Automation Trustworthy article covers how to set that up without it becoming a research project.

Time-to-Resolution

For workflows with a defined end state — a ticket resolved, a form submitted, an appointment confirmed — time-to-resolution is a clean metric. Compare agent-handled vs. human-handled resolution times. The gap is the efficiency story you tell internally.

One honest caveat: some tasks should go to humans and should take longer there. Time-to-resolution should be measured per task category, not averaged across everything.

Layer 3: Economics — What Is Each Outcome Costing You?

This is where AI agent metrics connect to the business case. The three metrics that matter most:

Cost Per Completed Task

Take your total agent running cost for a period — LLM API calls, infrastructure, any platform fees, plus a fair allocation of engineering maintenance time — and divide by the number of tasks completed. Compare this to the fully-loaded cost of a human completing the same task.

Illustrative scenario: a mid-sized e-commerce operation processes 3,000 return requests per month. Each takes roughly 8 minutes of human time at a blended employer cost of approximately CHF 0.65–0.85/minute (Swiss median wages for customer-facing staff plus employer social contributions of around 19%, per FSO Earnings Structure Survey data), so roughly CHF 5–7 per request. If the agent handles 70% of those at a total running cost of CHF 900/month, the per-task cost on agent-handled requests drops to well under CHF 1. This is illustrative math — your actual numbers depend on LLM usage, labour costs, and maintenance overhead — but this is the structure of the calculation.

LLM Token Cost per Task

As your agent scales, LLM API costs become a meaningful variable. Track tokens consumed per task, broken out by model. Long, unfocused system prompts and retrieval pipelines that return too much context drive this up unnecessarily — monitoring it flags inefficiency before it compounds.

Human Time Redirected

What are the humans doing now that the agent handles routine load? If the escalations that reach your team are genuinely higher-complexity or higher-value work, the agent is doing its job. If humans are reformatting things the agent produced imperfectly, you have a quality problem that cost-per-task alone won’t surface.

A Practical KPI Dashboard for Production Agents

Most teams overcomplicate this. For a production agent, start with six numbers tracked weekly:

Metric	What it tells you	Target direction
Containment rate	Tasks closed without human	Up
Task completion rate	Goals actually met	Up
Escalation rate by category	Where the agent breaks	Intent/confidence categories down
Error / hallucination rate	Output quality	Down
Cost per completed task	Economics	Down over time
Time-to-resolution (agent vs. human)	Efficiency gap	Agent faster

Review these monthly against the baseline you set at deployment. Flat containment rate after two months is a signal to investigate. Rising cost per task while containment holds usually means your prompts or retrieval pipeline need trimming.

The Measuring the ROI of AI Agents: A Framework for SMBs article covers how to build the financial case from these numbers once you have a few months of data.

When KPIs Signal It’s Time to Rethink the Agent

Sometimes the metrics tell you the agent design itself needs to change — not just the prompts or thresholds. Warning signs:

Containment rate plateaus below 50% despite multiple iterations
Escalation categories show the same unrecognised intents week after week
Users systematically override the agent rather than accepting its output
Cost per task is higher than the human baseline and not improving

These patterns usually point to a scoping problem: the agent was given too broad a task, or was deployed in a context where the input variability is higher than the agent’s design can handle. The From Pilot to Fleet: Managing AI Agents in Production article covers how to think about this systematically.

The harder failure mode is an agent that looks fine statistically but is slowly degrading trust. Qualitative signals matter here: support tickets about the agent, user feedback, and the rate at which customers escalate to a human within 24 hours of a “resolved” session.

What Good Measurement Looks Like From Day One

The teams that measure AI agents well share one practice: they define success criteria before the agent goes live, not after. What containment rate justifies the investment? What escalation rate triggers a prompt redesign review? What error rate is acceptable for your industry and task type?

These aren’t guesses — they’re negotiated between whoever owns the business outcome and whoever owns the technical build. Without that negotiation, every review meeting becomes an argument about whether 62% containment is good or disappointing.

If you’re in the process of planning or reviewing a deployment and want a clear view of which metrics fit your use case, Orange ITS runs a focused 30-minute session to map the right KPI set to your specific agent and business context. Get in touch to book that call.

For broader operational design questions, our Process Optimisation service covers how we instrument, monitor, and iterate on agents in production — including the measurement frameworks we use for client deployments.

The Why AI Agent Projects Fail — and How to De-Risk Yours article also covers measurement gaps as one of the most common failure modes, if you want to see how KPI blind spots contribute to broader project risk.

Frequently asked questions

Which KPIs should I track for an AI agent in production?

Start with six numbers tracked weekly: containment rate, task completion rate, escalation rate by category, error or hallucination rate, cost per completed task, and time-to-resolution versus humans. Review monthly against the baseline you set at deployment.

What is a good containment rate for an AI agent?

There is no universal number; it depends on the task. A tightly scoped FAQ agent should hit 70 to 85 percent or higher, while a document-intake agent dealing with varied formats might be doing well at 50 to 60 percent. Set the baseline in week one and track the trend.

What is the difference between deflection rate and task completion rate?

Deflection rate only measures how often the agent avoided a human touchpoint; a user who leaves after a non-answer still counts as deflected. Task completion rate tracks whether the user's actual goal was met, so pairing the two tells you whether you are deflecting or just abandoning users.

How do I calculate cost per completed task for an AI agent?

Take the total running cost for a period (LLM API calls, infrastructure, platform fees, plus a fair share of engineering maintenance time) and divide by tasks completed, then compare against the fully loaded cost of a human doing the same task. In an illustrative returns scenario, agent-handled requests dropped from CHF 5-7 to well under CHF 1 each.

When do KPIs signal the agent design itself is wrong?

Warning signs include containment plateauing below 50 percent despite iterations, the same unrecognised intents recurring week after week, users systematically overriding the agent, and cost per task staying above the human baseline. These usually point to a scoping problem: the agent was given too broad a task.