Most conversations about AI agent evaluation focus on the engineering team. Build a test harness, run a benchmark, ship when green. But if you are the business owner or operations lead signing off on an agent that will handle real customer queries, process real invoices, or route real escalations — the testing question belongs to you too.
This article is about ai agent evaluation as a buyer-side tool: what acceptance criteria to demand before you sign off, and what regression checks should run continuously after go-live to catch the silent failures that emerge over time.
Why “It Worked in the Demo” Is Not a Sign-Off Standard
A demo is a curated run. The agent handles the ten inputs someone already knows it handles well. Production is different: inputs arrive in unexpected formats, users phrase things in ways no one anticipated, connected APIs return edge-case responses, and the underlying language model gets updated by the provider.
Each of those events is a potential regression. Without a formal eval suite, the first signal you get is a complaint — or worse, a downstream error you only discover during an audit.
The gap between “works in demo” and “behaves reliably in production” is exactly where most AI agent projects encounter their first serious problems. Evals close that gap systematically rather than hoping nothing changes.
The Three Layers of a Reliable Eval Suite
A well-structured evaluation programme covers three distinct layers. They are not alternatives — you need all three.
1. Functional correctness: does the agent do what it was designed to do?
This layer tests the core task. For a support triage agent, that means: does it correctly classify ticket priority? Does it route to the right team? Does it handle a missing field without crashing?
Concrete acceptance criteria to demand from your vendor:
- A defined test set of at least 50–100 representative inputs (more for high-volume workflows), covering normal cases, edge cases, and known failure modes
- A target accuracy rate agreed in the SOW — not “best effort”, a number — e.g. ≥92% correct classification on the test set
- Documented handling for out-of-scope inputs: what does the agent do when it receives something it was not designed for? Does it fail gracefully or produce a confident wrong answer?
The test set itself should be part of your deliverable. If a vendor cannot show you the test cases, you have no baseline.
2. Tool and integration fidelity: does the agent interact correctly with connected systems?
Agents in production call external tools — CRMs, calendars, databases, payment APIs. Functional correctness in isolation does not guarantee correct behaviour when those integrations are in play.
This layer checks:
- Does the agent write data to the right fields, in the right format, under the right conditions?
- Does it handle API errors, rate limits, or unexpected response schemas without silently dropping data?
- Are there side-effect guardrails — i.e. does the agent refuse to take irreversible actions (delete records, send emails, charge cards) without a human confirmation step where the stakes warrant it?
For complex multi-system workflows, ask your vendor to demonstrate a failure injection test: deliberately return a malformed API response and show you what the agent does. An agent that panics or produces a hallucinated fallback is not production-ready.
3. Behavioural consistency: does the agent behave predictably across varied phrasing and conditions?
Language models are probabilistic. The same semantic intent expressed ten different ways should produce the same outcome. A support agent that correctly handles “I want to cancel my order” but misroutes “please stop my subscription” has a consistency problem that only shows up at scale.
This layer typically involves:
- Paraphrase testing: multiple formulations of the same intent
- Adversarial inputs: attempts to manipulate the agent into out-of-scope actions
- Persona boundary testing: does the agent stay within its defined role, or can a user coax it into unrelated territory?
This is closely related to the security and prompt injection risk profile of the agent — the two concerns share test infrastructure and should be addressed together.
What a Minimal Acceptance Checklist Looks Like
Before signing off on a new agent deployment, you should be able to answer yes to each of the following:
- A documented test set exists and was shared with us as a project deliverable
- Accuracy targets for core tasks are defined and have been met on the test set
- Edge case and error handling has been demonstrated, not just described
- Tool integrations have been tested with real (or realistic sandbox) connections
- The agent’s behaviour on out-of-scope inputs is defined and tested
- A baseline performance snapshot has been recorded so regression is detectable
This list is deliberately short. The point is not to run a PhD programme in ML evaluation — it is to establish a defensible baseline that protects you and holds your vendor accountable. A vendor uncomfortable with this list is a vendor to be cautious about.
For a broader view of what to ask when selecting a partner, the AI agent development company guide covers vendor due diligence in more depth.
Regression Checks: Catching Silent Degradation After Go-Live
Passing acceptance tests at launch is necessary but not sufficient. Agents degrade silently. The reasons are structural:
Model updates. The language model powering your agent will be updated by its provider — formal deprecations typically carry advance notice, but capability changes within a running named version and default-alias updates can occur with limited or no per-customer notification. A model update that improves performance on most tasks can regress performance on yours. Without a regression suite running on a fixed cadence, you will not know until users tell you something is wrong.
Data drift. The vocabulary and context of real user requests shifts over time. A customer support agent trained and tested on last year’s product catalogue may start misrouting requests after a product line change, even though the underlying model is unchanged.
Integration changes. An API your agent depends on updates its schema. A field name changes. A new required parameter appears. The agent either fails or falls back to an unintended behaviour.
Illustrative scenario: Imagine a document processing agent handling incoming supplier invoices. At launch, it correctly extracts line items and routes for approval at a 94% rate on the test set. Six months later, a key supplier changes their invoice template. Without a weekly regression check against a fixed sample of real invoices, that accuracy could quietly drop to 70% before anyone notices — meaning roughly one in three invoices hitting a manual fallback queue that was supposed to be automated. The monitoring cost of catching this early is trivial compared to the downstream reconciliation effort.
The minimum viable monitoring setup is not complex:
- A fixed golden dataset — a curated sample of 20–50 production inputs with known correct outputs, held out of any training or fine-tuning process
- A scheduled regression run — weekly or after any infrastructure change, run the agent against the golden dataset and compare outputs to baseline
- An alerting threshold — if accuracy on the golden set drops more than X percentage points from baseline, trigger a human review before the issue compounds
This connects directly to the broader question of which KPIs actually prove an agent is working — regression metrics should feed into the same operational dashboard as your business KPIs, not live in a separate engineering silo.
Who Owns Evals — and What to Contractualise
The vendor builds and runs the initial eval suite. The business owns the acceptance criteria and the right to see the results. After go-live, ongoing regression monitoring is a shared responsibility that should be explicit in your contract or service agreement.
Specifically, clarify:
- Who runs regression checks, and on what cadence?
- Who is notified when a regression threshold is breached?
- What is the SLA for investigation and resolution after a detected regression?
- Will you receive a summary report, or just an alert?
These are not adversarial questions. A vendor with mature AI agent development practices will have answers ready, because they run these checks for their own quality assurance. The conversation tells you a great deal about operational maturity.
Governance-minded organisations may also want to log eval results as part of a broader AI governance audit trail — particularly relevant for regulated sectors or where the EU AI Act applies. Note that most B2B automation agents (support triage, invoice processing, scheduling) fall under minimal- or limited-risk under the Act, not the high-risk Annex III category that carries the heaviest documentation obligations. For those high-risk obligations, the original August 2026 deadline is likely to be deferred to December 2027 under the Digital Omnibus provisional agreement, though formal adoption was still pending at the time of publication.
Evals Are a Trust Mechanism, Not a Formality
The deeper point is this: an eval suite is not bureaucratic overhead. It is the evidence base that lets you extend trust to an automated system operating at scale in your business. Without it, you are relying on intuition and hoping no edge case surfaces at a bad moment.
Buyers who ask for evals get better agents. The process of defining acceptance criteria forces specificity — about what the agent should do, what it should not do, and how performance will be measured. That specificity tends to surface misaligned expectations early, when they are cheap to fix.
If you are at the stage of evaluating a vendor proposal, defining acceptance criteria for a new agent, or trying to establish monitoring for an agent already in production — a focused 30-minute call with our team can help you identify the specific checks and criteria that fit your use case. Get in touch with Orange ITS and we will structure the conversation around your situation.