A Production-Readiness Test for AI Agent Frameworks

Most AI agent projects don’t fail because the prototype was bad. They fail because the team picked a framework that worked fine in a Jupyter notebook and discovered its gaps three months into a production build — when the cost of switching is high and the pressure to ship is higher.

LangGraph, CrewAI, AutoGen (now in maintenance mode — Microsoft redirects new projects to Microsoft Agent Framework), Mastra, VoltAgent, Smolagents — each has a legitimate story, a growing community, and demos that look compelling. The problem is that demos are not the right unit of evaluation. The right unit is: will this hold up under real workload, with real failure modes, inside an organisation that needs to audit, govern and maintain it?

Below are the eight criteria we apply on every client engagement before committing to any agent framework. Work through them and you can disqualify most weak candidates in an afternoon.

Why Framework Production Readiness Is Its Own Discipline

A language model calling a function is not an agent. An agent is an automated system that perceives state, chooses actions, calls tools, and loops — sometimes for minutes, sometimes across multiple hops, sometimes with money or data on the line. The gap between “it works in a demo” and “safe to run unsupervised at scale” is larger than most teams expect.

Production readiness covers concerns tutorials skip: what happens when an LLM call times out mid-loop? Who audits the agent’s decisions? Where does conversation state live if your container crashes? These aren’t edge cases — they’re the normal operating conditions of any system running at volume.

The Eight-Criterion Checklist

1. Observability: Can You See What the Agent Actually Did?

This is the first thing to probe and the most commonly underbuilt.

A production agent needs structured, exportable traces: step-level records showing which tool was called with which arguments, what the LLM returned, how long each step took, and where the loop branched. Without this, debugging a failure is archaeology.

Questions to ask:

Does the framework emit OpenTelemetry-compatible traces natively, or do you bolt them on?
Can you replay a specific run for post-mortem analysis?
Are token counts and latency tracked at the step level?

Frameworks that treat logging as an afterthought force you to instrument everything yourself — that typically costs a sprint on any serious deployment.

2. Failure Handling: What Breaks Gracefully and What Explodes

LLM calls fail. APIs return rate-limit errors. Tools throw exceptions. The question is not whether failures happen but whether the framework has a principled response to them.

Look for:

Retry policies with configurable backoff, distinguishing retryable from terminal failures
Partial-run recovery — can an interrupted workflow resume from its last good state, or restart from zero?
Tool error propagation — does the agent get a useful signal, or silently loop into a bad state?

A framework with no retry logic and no state checkpointing will cause production incidents at 2am.

3. State and Memory Persistence: Who Owns the Context?

Short demo agents are stateless. Real agents are not.

For anything that spans multiple turns, user sessions, or long-running tasks, you need to understand exactly where agent state lives:

Is it in-process memory (gone on restart), in a database, or in an external store?
Who manages schema migrations when your agent’s state shape changes?
Can state be inspected and corrected manually if an agent gets stuck?

Some frameworks default to in-memory state with no persistence layer. That is fine for prototyping. For a customer-facing agent handling support tickets or sales enquiries, it means losing context on every deploy — or building a persistence layer from scratch.

Related: for multi-agent architectures, shared state coordination becomes its own problem. Check whether the framework has a documented pattern for this, or leaves it entirely to you. See AI Agent Orchestration: Making Agents Work as a System for a deeper treatment.

4. Evaluation Tooling: How Do You Know It’s Getting Better?

This is the criterion that separates teams who ship reliable agents from teams who “monitor it manually.”

Production agents need regression testing — the ability to run a defined suite of inputs and assert that outputs meet a quality bar. This is called evals, and it’s non-trivial to build from scratch.

Ask:

Does the framework ship eval tooling, or does it assume you’ll integrate an external tool?
Can you run evals locally before a deploy?
Is there a dataset format for capturing good/bad examples from production to feed back into testing?

Without eval infrastructure, every framework update or prompt change becomes a roll of the dice. See Testing AI Agents: How Evals Keep Automation Trustworthy for a deeper treatment.

5. Security Posture: Tool Execution and Prompt Injection

Agents execute code, call APIs, and write to databases. The security surface is materially different from a chatbot.

Critical things to verify:

Tool permissioning — can you restrict which tools an agent can call based on context or role? Or does every tool have the same access level?
Input sanitisation — is there any protection against prompt injection attacks, where hostile content in a retrieved document hijacks agent behaviour?
Secrets management — are API keys and credentials handled at the framework level, or is each developer managing them ad hoc?

A framework with no concept of tool-level permissions is not suitable for any workflow where an agent touches customer data or external systems with write access.

6. Governance Hooks: Human-in-the-Loop When It Matters

Not every decision should be autonomous. Your governance team — and eventually regulators — will want documented points where a human could intervene or review.

Look for:

Interrupt / pause mechanics — can the agent surface a decision to a human operator before taking an action above a defined risk threshold?
Approval workflows — is there a built-in pattern for human sign-off on specific tool calls (e.g., sending an email, issuing a refund)?
Audit log — is there an immutable record of what the agent decided and why, at a granularity sufficient for compliance review?

Governance is increasingly a contractual requirement for enterprise clients and a live regulatory obligation under the EU AI Act, which has been in force since August 2024 with multiple tranches already binding. See AI Agent Governance: A Practical Playbook for SMEs for the broader governance picture.

7. Community Health and Maintenance Trajectory

OSS frameworks are only as durable as the communities behind them. Before committing, check:

Commit frequency — is the repo actively maintained or drifting?
Issue response time — how long before a maintainer acknowledges a bug report?
Breaking change policy — does the project document API stability and migration paths?
Backer dependency — is this a single-company project where one corporate pivot ends support?

A framework that was trending on GitHub six months ago with falling contribution velocity is a risk worth pricing in before you spend months of engineering time on it.

8. Exit Cost: What Does Migration Look Like?

This is the question almost nobody asks during selection and everyone asks eighteen months later.

If you need to move off this framework — because it stopped being maintained, because your requirements outgrew it, or because a better option emerged — what does that cost?

Consider:

How much of your business logic is in framework-specific constructs versus portable Python/TypeScript?
Are your tool definitions reusable outside the framework?
Is your agent state schema in a format you own, or is it opaque to the framework?

Frameworks with high abstraction and opaque internals tend to have high exit costs. That’s not automatically disqualifying, but it should be priced in. See AI Agent Platform Lock-In: The Risks Nobody Prices In for a detailed treatment.

How to Score a Framework Quickly

Run each criterion on a three-point scale:

Score	Meaning
2	Native, documented, works out of the box
1	Possible with integration or custom code
0	Absent — you’re building it yourself

A total below 10 out of 16 is a yellow flag. Below 8 is a hard pass for anything customer-facing or compliance-sensitive. The specific zeros matter more than the total — a zero on security posture is a blocker regardless of the overall score.

What a Framework Vetting Session Actually Looks Like

Run the eight criteria against your shortlist in a structured session. Two frameworks get eliminated quickly because they lack interrupt/approval mechanics. One scores well on governance but has a thin maintenance history — it gets flagged as a dependency risk. The remaining candidate hits 13/16 with a clear migration path. That’s the one you build on.

The criteria are the same regardless of which frameworks land on your shortlist. For broader context on what typically makes the cut, see Choosing an Open-Source AI Agent Framework: A CTO’s Shortlist and CrewAI vs LangGraph: Choosing the Right Agent Framework.

When to Get External Eyes

If your team is about to commit budget to an agent framework without a structured evaluation, that’s the highest-leverage moment to bring in someone who has done this before. Framework selection affects your security posture, compliance documentation, ops overhead, and future migration cost — not just your sprint velocity.

At Orange ITS, this evaluation runs as part of every AI agent development engagement. We’ve seen frameworks that look excellent in demos and collapse under real workloads, and frameworks that are rough to start but durable at scale. That pattern-matching is what a good evaluation partner brings.

Ready to pressure-test your framework shortlist before the build starts? Book a 30-minute call with Orange ITS — we’ll score your candidates against these criteria and tell you honestly which ones we’d build on.

Frequently asked questions

How do I know if an AI agent framework is production-ready?

Score it against eight criteria: observability, failure handling, state persistence, evaluation tooling, security posture, governance hooks, community health, and exit cost. Rate each 0 to 2; a total below 10 out of 16 is a yellow flag, and below 8 is a hard pass for customer-facing or compliance-sensitive work.

Why do AI agent projects fail after a successful prototype?

Most fail because the chosen framework worked fine in a notebook but broke down under real production workloads, where failures like LLM timeouts, lost state, and missing audit trails are normal operating conditions. The gap between a demo and unsupervised operation at scale is larger than most teams expect.

Which framework evaluation criteria are hard blockers?

A zero score on security posture or governance hooks disqualifies a framework regardless of the overall total. Without tool-level permissions, prompt injection protection, or human-in-the-loop approval mechanics, a framework is unsuitable for workflows touching customer data or write access.

What is exit cost when choosing an AI agent framework?

Exit cost is what migrating off the framework would take if it stops being maintained or your needs outgrow it. Frameworks with high abstraction and opaque internals, framework-specific business logic, and proprietary state formats carry high exit costs that are rarely priced in at selection time.

Is AutoGen still a good choice for new agent projects?

AutoGen is now in maintenance mode, and Microsoft redirects new projects to the Microsoft Agent Framework. Maintenance trajectory and backer dependency are exactly the kind of risks the production-readiness checklist is designed to surface before you commit.