Testing AI Agents: How Evals Keep Automation Trustworthy

Most conversations about AI agent evaluation focus on the engineering team. Build a test harness, run a benchmark, ship when green. But if you are the business owner or operations lead signing off on an agent that will handle real customer queries, process real invoices, or route real escalations — the testing question belongs to you too.

This article is about ai agent evaluation as a buyer-side tool: what acceptance criteria to demand before you sign off, and what regression checks should run continuously after go-live to catch the silent failures that emerge over time.

Why “It Worked in the Demo” Is Not a Sign-Off Standard

A demo is a curated run. The agent handles the ten inputs someone already knows it handles well. Production is different: inputs arrive in unexpected formats, users phrase things in ways no one anticipated, connected APIs return edge-case responses, and the underlying language model gets updated by the provider.

Each of those events is a potential regression. Without a formal eval suite, the first signal you get is a complaint — or worse, a downstream error you only discover during an audit.

The gap between “works in demo” and “behaves reliably in production” is exactly where most AI agent projects encounter their first serious problems. Evals close that gap systematically rather than hoping nothing changes.

The Three Layers of a Reliable Eval Suite

A well-structured evaluation programme covers three distinct layers. They are not alternatives — you need all three.

1. Functional correctness: does the agent do what it was designed to do?

This layer tests the core task. For a support triage agent, that means: does it correctly classify ticket priority? Does it route to the right team? Does it handle a missing field without crashing?

Concrete acceptance criteria to demand from your vendor:

A defined test set of at least 50–100 representative inputs (more for high-volume workflows), covering normal cases, edge cases, and known failure modes
A target accuracy rate agreed in the SOW — not “best effort”, a number — e.g. ≥92% correct classification on the test set
Documented handling for out-of-scope inputs: what does the agent do when it receives something it was not designed for? Does it fail gracefully or produce a confident wrong answer?

The test set itself should be part of your deliverable. If a vendor cannot show you the test cases, you have no baseline.

2. Tool and integration fidelity: does the agent interact correctly with connected systems?

Agents in production call external tools — CRMs, calendars, databases, payment APIs. Functional correctness in isolation does not guarantee correct behaviour when those integrations are in play.

This layer checks:

Does the agent write data to the right fields, in the right format, under the right conditions?
Does it handle API errors, rate limits, or unexpected response schemas without silently dropping data?
Are there side-effect guardrails — i.e. does the agent refuse to take irreversible actions (delete records, send emails, charge cards) without a human confirmation step where the stakes warrant it?

For complex multi-system workflows, ask your vendor to demonstrate a failure injection test: deliberately return a malformed API response and show you what the agent does. An agent that panics or produces a hallucinated fallback is not production-ready.

3. Behavioural consistency: does the agent behave predictably across varied phrasing and conditions?

Language models are probabilistic. The same semantic intent expressed ten different ways should produce the same outcome. A support agent that correctly handles “I want to cancel my order” but misroutes “please stop my subscription” has a consistency problem that only shows up at scale.

This layer typically involves:

Paraphrase testing: multiple formulations of the same intent
Adversarial inputs: attempts to manipulate the agent into out-of-scope actions
Persona boundary testing: does the agent stay within its defined role, or can a user coax it into unrelated territory?

This is closely related to the security and prompt injection risk profile of the agent — the two concerns share test infrastructure and should be addressed together.

What a Minimal Acceptance Checklist Looks Like

Before signing off on a new agent deployment, you should be able to answer yes to each of the following:

A documented test set exists and was shared with us as a project deliverable
Accuracy targets for core tasks are defined and have been met on the test set
Edge case and error handling has been demonstrated, not just described
Tool integrations have been tested with real (or realistic sandbox) connections
The agent’s behaviour on out-of-scope inputs is defined and tested
A baseline performance snapshot has been recorded so regression is detectable

This list is deliberately short. The point is not to run a PhD programme in ML evaluation — it is to establish a defensible baseline that protects you and holds your vendor accountable. A vendor uncomfortable with this list is a vendor to be cautious about.

For a broader view of what to ask when selecting a partner, the AI agent development company guide covers vendor due diligence in more depth.

Regression Checks: Catching Silent Degradation After Go-Live

Passing acceptance tests at launch is necessary but not sufficient. Agents degrade silently. The reasons are structural:

Model updates. The language model powering your agent will be updated by its provider — formal deprecations typically carry advance notice, but capability changes within a running named version and default-alias updates can occur with limited or no per-customer notification. A model update that improves performance on most tasks can regress performance on yours. Without a regression suite running on a fixed cadence, you will not know until users tell you something is wrong.

Data drift. The vocabulary and context of real user requests shifts over time. A customer support agent trained and tested on last year’s product catalogue may start misrouting requests after a product line change, even though the underlying model is unchanged.

Integration changes. An API your agent depends on updates its schema. A field name changes. A new required parameter appears. The agent either fails or falls back to an unintended behaviour.

Illustrative scenario: Imagine a document processing agent handling incoming supplier invoices. At launch, it correctly extracts line items and routes for approval at a 94% rate on the test set. Six months later, a key supplier changes their invoice template. Without a weekly regression check against a fixed sample of real invoices, that accuracy could quietly drop to 70% before anyone notices — meaning roughly one in three invoices hitting a manual fallback queue that was supposed to be automated. The monitoring cost of catching this early is trivial compared to the downstream reconciliation effort.

The minimum viable monitoring setup is not complex:

A fixed golden dataset — a curated sample of 20–50 production inputs with known correct outputs, held out of any training or fine-tuning process
A scheduled regression run — weekly or after any infrastructure change, run the agent against the golden dataset and compare outputs to baseline
An alerting threshold — if accuracy on the golden set drops more than X percentage points from baseline, trigger a human review before the issue compounds

This connects directly to the broader question of which KPIs actually prove an agent is working — regression metrics should feed into the same operational dashboard as your business KPIs, not live in a separate engineering silo.

Who Owns Evals — and What to Contractualise

The vendor builds and runs the initial eval suite. The business owns the acceptance criteria and the right to see the results. After go-live, ongoing regression monitoring is a shared responsibility that should be explicit in your contract or service agreement.

Specifically, clarify:

Who runs regression checks, and on what cadence?
Who is notified when a regression threshold is breached?
What is the SLA for investigation and resolution after a detected regression?
Will you receive a summary report, or just an alert?

These are not adversarial questions. A vendor with mature AI agent development practices will have answers ready, because they run these checks for their own quality assurance. The conversation tells you a great deal about operational maturity.

Governance-minded organisations may also want to log eval results as part of a broader AI governance audit trail — particularly relevant for regulated sectors or where the EU AI Act applies. Note that most B2B automation agents (support triage, invoice processing, scheduling) fall under minimal- or limited-risk under the Act, not the high-risk Annex III category that carries the heaviest documentation obligations. For those high-risk obligations, the original August 2026 deadline is likely to be deferred to December 2027 under the Digital Omnibus provisional agreement, though formal adoption was still pending at the time of publication.

Evals Are a Trust Mechanism, Not a Formality

The deeper point is this: an eval suite is not bureaucratic overhead. It is the evidence base that lets you extend trust to an automated system operating at scale in your business. Without it, you are relying on intuition and hoping no edge case surfaces at a bad moment.

Buyers who ask for evals get better agents. The process of defining acceptance criteria forces specificity — about what the agent should do, what it should not do, and how performance will be measured. That specificity tends to surface misaligned expectations early, when they are cheap to fix.

If you are at the stage of evaluating a vendor proposal, defining acceptance criteria for a new agent, or trying to establish monitoring for an agent already in production — a focused 30-minute call with our team can help you identify the specific checks and criteria that fit your use case. Get in touch with Orange ITS and we will structure the conversation around your situation.

Frequently asked questions

What acceptance criteria should I demand before signing off an AI agent?

A documented test set of 50 to 100+ representative inputs delivered to you, a specific accuracy target written into the SOW rather than best effort, demonstrated edge-case and error handling, tested tool integrations, defined out-of-scope behaviour, and a recorded baseline snapshot so regression is detectable.

Why do AI agents degrade after launch even when nothing changed?

Three structural reasons: the underlying model gets updated by its provider (sometimes with limited notice), real user vocabulary and context drift over time, and connected APIs change schemas. Without scheduled regression checks, the first signal is usually a complaint.

What is a golden dataset and why do I need one?

It is a curated sample of 20 to 50 production inputs with known correct outputs, held out of any training process. Running the agent against it weekly or after infrastructure changes, with an alerting threshold on accuracy drops, is the minimum viable regression monitoring setup.

What is a failure injection test for an AI agent?

Before sign-off, the vendor deliberately returns a malformed API response and shows you what the agent does. An agent that panics or produces a hallucinated fallback is not production-ready; graceful handling of integration failures is a core acceptance criterion.

Who is responsible for running AI agent evals after go-live?

It is a shared responsibility that should be explicit in the contract: who runs regression checks and on what cadence, who is notified on a threshold breach, what the investigation SLA is, and whether you receive reports. A vendor with mature practices will have ready answers.

Testing AI Agents: How Evals Keep Automation Trustworthy

Why “It Worked in the Demo” Is Not a Sign-Off Standard

The Three Layers of a Reliable Eval Suite

1. Functional correctness: does the agent do what it was designed to do?

2. Tool and integration fidelity: does the agent interact correctly with connected systems?

3. Behavioural consistency: does the agent behave predictably across varied phrasing and conditions?

What a Minimal Acceptance Checklist Looks Like

Regression Checks: Catching Silent Degradation After Go-Live

Who Owns Evals — and What to Contractualise

Evals Are a Trust Mechanism, Not a Formality

Frequently asked questions

Related insights

AI Agent Security Risks — and How to Mitigate Them

Embedded AI Engineer vs Freelancer vs Agency

How to Choose an AI Agent Development Company

Put these ideas to work