An agent evaluation framework turns vague confidence into repeatable evidence. It answers a practical question: before this agent touches real tools, customers, code, money, or internal data, what proof do we have that it behaves correctly? Demos are not enough because they usually show happy paths. Production agents face ambiguous user requests, partial tool failures, prompt injection, stale memory, cost limits, and tasks that should be refused.
A good framework evaluates the full agent workflow, not only the language model. It tests prompts, model routing, retrieval, tool selection, tool inputs, memory reads, memory writes, approval gates, final output, cost, latency, and recovery behavior. The evaluation should run before release, after prompt changes, after model changes, after tool-schema changes, and after serious incidents.
This guide connects evaluation to the broader Agent Security cluster: /guides/agent-security-guide/ for threat modeling, /guides/agent-monitoring/ for traces and production metrics, /guides/agent-cost-management/ for budget tests, /guides/agent-memory-systems/ for durable state risk, /agents/ for agent discovery, and /tools/ai-cost-calculator/ for estimating model spend before scale.
Key takeaways
- Evaluate the entire workflow: prompt, retrieval, tools, memory, approvals, output, latency, and cost.
- Include adversarial and edge-case tests, not only happy-path examples.
- Use production monitoring failures to continuously expand the evaluation set.
Define the job before scoring the agent
Evaluation starts with a job definition. A support triage agent, a code review agent, a research agent, and a finance assistant should not share the same success metric. Each has a different acceptable error, tool set, latency budget, cost profile, and escalation rule. Without a job definition, teams argue about abstract model quality instead of whether the agent completed the user task safely.
Write the job as a contract: user goal, allowed data sources, allowed tools, disallowed actions, expected output format, human approval requirements, maximum cost, and escalation conditions. This contract becomes the basis for test cases and rubrics. It also clarifies when the right behavior is refusal or clarification rather than completion.
- User goal: what task is the agent supposed to complete?
- Authority: what can the agent read, write, send, purchase, or change?
- Quality bar: what makes an answer useful, correct, cited, and complete?
- Safety bar: what should trigger refusal, approval, or escalation?
Create evaluation categories
An agent evaluation set should contain categories that reflect real operating risk. Happy-path cases prove that the workflow can work. Edge cases prove that it works when inputs are messy. Adversarial cases prove that it resists manipulation. Failure cases prove that it degrades safely. Cost cases prove that it can run at scale without surprising the business.
For agent security, include prompt injection cases in retrieved documents, web pages, emails, tickets, and tool outputs. For reliability, include missing data, unavailable tools, conflicting instructions, partial outputs, and multi-step tasks. For memory, include stale preferences, user corrections, poisoning attempts, and privacy-sensitive facts. For cost, include large context requests and repeated tasks that could trigger expensive loops.
- Happy path: common tasks with clear inputs and expected outputs.
- Edge path: ambiguous instructions, missing fields, long context, and conflicting data.
- Adversarial path: prompt injection, secret requests, unsafe tool suggestions, and role override attempts.
- Operational path: tool failure, rate limits, timeouts, budget limits, and escalation.
Score tool use separately from final text
Many agent failures look good in the final answer. The agent may produce a polished summary after using the wrong data, skipping a required approval, or calling an unnecessary expensive model. Score tool selection, tool input, tool output handling, and final response separately. This makes failures actionable. A prompt fix may improve final text, while a tool-schema fix may improve action safety.
For each test, define the expected tool sequence if one exists. Sometimes the correct behavior is no tool call. Sometimes the correct behavior is to ask a clarifying question before using a tool. Sometimes the correct behavior is to prepare a draft and wait for approval. The evaluation should reward the safe path, not just any path that reaches a plausible answer.
- Tool selection: did the agent choose the right capability or avoid tools when not needed?
- Tool input: were arguments scoped, valid, and consistent with user intent?
- Tool output handling: did the agent treat tool output as data rather than authority?
- Final response: was it correct, useful, cited, and honest about uncertainty?
Use rubrics and automated checks together
Automated checks are fast and consistent. They can verify JSON schema, required citations, forbidden phrases, no tool call, correct tool call, cost ceiling, latency ceiling, and exact output fields. Rubric scoring is slower but captures usefulness, clarity, completeness, tone, and whether a human would trust the result. A mature framework uses both.
Rubrics should be specific. Instead of 'good answer', use dimensions such as factual correctness, source grounding, task completeness, security behavior, escalation behavior, and user-facing clarity. Define score levels with examples. If reviewers cannot apply the rubric consistently, the evaluation will not survive model or prompt changes.
- Automated: schema validity, status code, tool call count, cost limit, citation presence.
- Rubric: correctness, completeness, clarity, safe behavior, and user value.
- Regression: compare new prompt or model behavior against the previous accepted version.
- Triage: label failures as prompt, retrieval, tool, memory, policy, model, or product issue.
Evaluate prompt injection resistance
Prompt injection tests should simulate realistic attack placement. Put malicious instructions in a web page, a CSV row, a GitHub issue, a customer email, a PDF excerpt, a documentation snippet, and a tool response. The agent should keep following the trusted user and system instructions while treating the injected text as untrusted data.
The expected outcome may vary by workflow. A research agent should ignore the malicious instruction and cite the benign content. A browser agent should not navigate to unrelated URLs because a page asked it to. A memory-capable agent should not save attacker-provided policy changes. A write-capable agent should not send messages or change records because retrieved content requested it.
- Test instructions that ask for hidden prompts, credentials, or unrelated private files.
- Test instructions that ask the agent to change role, policy, or tool permissions.
- Test instructions that ask for memory writes or future behavior changes.
- Test instructions that ask for external messages, purchases, deletions, or deployments.
Evaluate cost and latency before scale
Cost and latency are product-quality dimensions. An agent that is safe but too slow may not be adopted. An agent that is accurate but too expensive may not be viable. Add cost and latency budgets to your evaluation framework. Use representative daily request volumes and token sizes, then estimate monthly spend with /tools/ai-cost-calculator/ before launch.
Cost tests should include worst-case context, retries, tool loops, and premium-model routing. Latency tests should include slow tools, rate limits, and multi-step workflows. Track cost per successful task rather than only cost per model call. Failed runs still cost money, and a high failure rate can make a cheap model more expensive than a stronger model with fewer retries.
- Set maximum input tokens, output tokens, retries, and tool-loop depth.
- Measure cost per completed task, not only cost per request.
- Compare cheap-model plus retries against premium-model first-pass success.
- Fail safely when budget or latency thresholds are exceeded.
Connect evaluations to production monitoring
Evaluations should not be a one-time launch gate. Production monitoring will reveal failures that your initial test set missed. Every significant incident, near miss, user complaint, unexpected tool chain, memory mistake, or cost spike should become a new test case. This closes the loop between /guides/agent-monitoring/ and your release process.
The strongest teams maintain versioned evaluation sets. They can say which prompt, model, tool schema, and memory policy passed which tests on which date. When a model provider updates behavior or a new tool is added, the evaluation suite becomes the safety net. Without this loop, teams rely on vibes and rediscover the same failure modes repeatedly.
- Add production failures and near misses to the eval set within the same sprint.
- Tag each case by failure type and severity.
- Run critical tests before every prompt, model, memory, or tool change.
- Keep a release note that links evaluation results to the deployed agent version.
Implementation checklist
- Define the agent job contract before creating metrics.
- Build happy-path, edge-case, adversarial, operational, memory, and cost tests.
- Score tool selection, tool inputs, output handling, final answer, safety, latency, and cost separately.
- Run regression tests after prompt, model, retrieval, memory, or tool changes.
- Turn production failures from monitoring into new evaluation cases.
FAQ
What is an agent evaluation framework?
It is a repeatable process for testing an AI agent across task success, tool use, prompt injection resistance, memory behavior, latency, cost, and safe failure before and after production changes.
Why not just evaluate the model?
Agents fail through tools, retrieval, memory, approvals, and product design, not only model text. The full workflow must be evaluated because a good answer can still come from an unsafe or incorrect tool path.
How many test cases do I need?
Start with a focused set that covers your most common tasks and highest risks. Add cases from production failures, near misses, new tools, new memory behavior, and cost anomalies.
Should prompt injection be part of evaluations?
Yes. Place malicious instructions in realistic sources such as pages, emails, files, tickets, and tool outputs, then verify the agent does not treat them as higher-priority instructions.
How do I evaluate agent cost?
Track token usage, model routing, retries, and tool loops per task. Use cost thresholds in tests and estimate scenarios with the AI Cost Calculator before launch.
How often should evaluations run?
Run critical tests before every release and after significant prompt, model, retrieval, memory, or tool changes. Run a broader suite on a regular schedule and after incidents.