AI Agent Evaluation Metrics That Matter
Task Completion Rate vs Accuracy
Most benchmark leaderboards report accuracy: the percentage of tasks the agent answered correctly out of tasks it attempted. Task completion rate is a different and often more important number: the percentage of tasks the agent successfully finished, including those it abandoned, timed out on, or crashed during.
The distinction matters because production agents cannot skip tasks. When a customer support agent encounters a difficult ticket, it cannot simply ignore it the way a benchmark runner ignores a failed test case. When a coding agent crashes halfway through a complex refactoring task, the partially applied changes might leave the codebase in a broken state. An agent with 90% accuracy but a 70% completion rate is effectively failing 30% of all assigned work, a number that makes it unreliable for any workflow that requires consistent throughput.
Measuring completion rate requires tracking several failure modes separately. Timeout failures occur when the agent exceeds its time budget without producing a result. Crash failures occur when the agent encounters an unhandled error and stops working. Abandonment failures occur when the agent determines it cannot complete the task and gives up. Partial completion occurs when the agent produces output that addresses some but not all aspects of the task. Each failure mode suggests a different remediation strategy, making disaggregated tracking more useful than a single completion number.
In practice, completion rate is often more improvable than accuracy. Timeout failures respond to better planning strategies that avoid unproductive exploration. Crash failures respond to better error handling and input validation. Abandonment can be reduced by providing the agent with fallback strategies and escalation paths. Teams that focus on completion rate improvements often see larger practical gains than those who focus exclusively on accuracy.
Cost Per Task
Every AI agent task incurs compute costs from LLM inference, tool execution, and infrastructure overhead. Cost per task measures the total expense to complete a single unit of work, and it varies by orders of magnitude across different agent architectures and approaches.
A simple single-pass agent that reads a prompt, generates one response, and returns it might cost $0.01-0.05 per task. A multi-step agent with planning, tool use, and verification might cost $0.10-1.00 per task. A multi-agent system with specialized roles, iterative refinement, and consensus mechanisms might cost $1.00-10.00 per task. The cost depends on how many LLM calls the architecture makes, which model it uses for each call, how much context it includes, and how many retries it performs on failure.
The relationship between cost and accuracy is not linear. Moving from 70% to 80% accuracy might require doubling the cost, while moving from 80% to 85% might require tripling it again. This diminishing-returns curve means that the economically optimal accuracy level depends entirely on the value of each correct completion and the cost of each error. For a customer support agent handling $20 tickets, the optimal point might be 80% accuracy at $0.10 per task. For a legal document review agent where errors carry significant liability, 95% accuracy at $5.00 per task might be the right tradeoff.
Cost measurement should include all components, not just LLM inference. Tool execution costs like API calls to external services, compute for code execution, and storage for intermediate results can exceed the LLM costs for complex tasks. Infrastructure costs like orchestration servers, message queues, and monitoring systems add overhead that is real even if it does not appear in the per-task LLM bill. Total cost of ownership provides a more honest basis for ROI calculations than LLM cost alone.
Latency and Throughput
Latency measures the wall-clock time from when a task is assigned to when the result is delivered. For interactive applications like customer support, code assistance, and real-time analysis, latency determines whether the agent feels responsive or frustrating. For batch applications like document processing, data analysis, and content generation, throughput (tasks completed per unit of time) matters more than individual task latency.
Agent latency has several components. Planning time is how long the agent spends deciding what to do. Execution time is how long each tool call or LLM inference takes. Waiting time is the time spent in queues or waiting for rate-limited API responses. Overhead time includes context assembly, state management, and communication between agent components. Each component contributes differently depending on the agent architecture and the specific task.
Multi-step agents inherently trade latency for accuracy. Each additional step of planning, verification, or reflection adds time. A coding agent that generates a patch, reviews it, tests it, and revises it based on test results will be slower than one that generates a patch and submits it immediately. The slower agent is also likely to produce more correct results. Choosing the right balance requires understanding the latency tolerance of your specific application.
Throughput optimization follows different principles than latency optimization. For batch workloads, running multiple agent instances in parallel can increase throughput without reducing per-task latency. Batching multiple tasks into a single agent session can amortize context loading costs. Prioritizing tasks by complexity and routing simple tasks to faster, cheaper agents while reserving complex tasks for more capable systems can optimize overall system throughput.
Token Efficiency
Token efficiency measures how many input and output tokens the agent consumes per task. Since LLM costs are directly proportional to token usage, and since context window limits constrain how much information the agent can process, token efficiency is both a cost metric and a capability metric.
Input tokens include the system prompt, task description, tool results, conversation history, and any retrieved context. Agents that load large amounts of context for every tool call burn through tokens quickly. Agents that carefully manage their context, loading only relevant information and pruning unnecessary history, accomplish the same work with fewer tokens.
Output tokens include the agent's reasoning, tool call requests, intermediate responses, and final answers. Verbose reasoning strategies like chain-of-thought and explicit planning generate more output tokens but often produce better results. The tradeoff between reasoning thoroughness and token cost is one of the key design decisions in agent architecture.
Inefficient token usage manifests in several recognizable patterns. Re-reading the same files or documents multiple times because the agent loses track of what it has already seen. Generating lengthy internal monologues that do not contribute to task progress. Making redundant tool calls that retrieve information the agent already has. Including irrelevant context from previous tasks that pollutes the current working memory. Each of these patterns represents an optimization opportunity that can reduce costs significantly without sacrificing quality.
Measuring token efficiency requires comparing total tokens consumed against task complexity. Simple tasks should consume proportionally fewer tokens than complex ones. If a simple question-answering task consumes as many tokens as a complex multi-step analysis, the agent's context management needs improvement. Plotting token consumption against task complexity reveals patterns that single aggregate numbers miss.
Error Recovery Rate
Production environments are inherently unpredictable. APIs return unexpected errors, data arrives in malformed formats, network connections time out, and tool outputs contain results the agent did not anticipate. Error recovery rate measures how often the agent handles these situations gracefully rather than failing entirely.
The measurement is straightforward: of all the errors encountered during task execution, what percentage did the agent recover from and continue working? An agent with a 90% error recovery rate encounters ten errors during a task and recovers from nine of them. An agent with a 50% error recovery rate fails on the first unexpected error half the time.
Error recovery strategies range from simple to sophisticated. Retry logic handles transient failures like network timeouts and rate limit responses. Alternative tool selection handles cases where the primary tool for a task is unavailable. Plan revision handles cases where an unexpected result invalidates the agent's current approach, requiring it to form a new plan. Graceful degradation handles cases where the agent cannot achieve the full task objective but can deliver a partial result that still provides value.
This metric is rarely reported in public benchmarks because benchmark environments are typically stable and predictable. The gap between benchmark stability and production volatility means that error recovery rate, measured in your own environment against your own error conditions, is one of the most valuable internal metrics for predicting production reliability. Building error injection into your evaluation pipeline, deliberately introducing failures to see how the agent responds, provides data that no public benchmark captures.
Consistency and Variance
Language models are stochastic by design, which means the same agent given the same task can produce different results on different runs. Consistency measures how much variation exists across repeated executions, and it directly affects how much you can trust the agent's output.
High variance manifests as unpredictable quality. An agent might solve a task correctly on one run and fail completely on the next, with no change in the input. For production applications, this unpredictability is worse than consistently mediocre performance because it makes it impossible to know when human review is needed. A consistent agent with 80% accuracy is more useful than an inconsistent one with 85% average accuracy but high variance, because the consistent agent's errors are more predictable.
Measuring consistency requires running the same tasks multiple times with identical inputs. The standard approach is to run each evaluation task five to ten times and compute the standard deviation of success/failure outcomes. Tasks where the agent succeeds on all runs are reliably within its capability. Tasks where it fails on all runs are reliably beyond it. Tasks where it succeeds sometimes and fails other times represent the zone of uncertainty where the agent's behavior is least predictable.
Several techniques improve consistency without changing the underlying model. Temperature reduction makes the model's outputs more deterministic. Structured prompting that constrains the agent's response format reduces the space of possible outputs. Ensemble methods that run the task multiple times and select the most common answer trade cost for consistency. Planning architectures that separate the plan from the execution allow the agent to commit to a specific approach rather than making ad-hoc decisions at each step.
For high-stakes applications, consistency matters more than peak performance. A legal document review agent that catches the same issues reliably across repeated runs is more trustworthy than one that sometimes catches additional issues but sometimes misses critical ones. Consistency is what allows teams to calibrate their trust in the agent and design appropriate human oversight processes.
Building a Composite Evaluation
No single metric captures production readiness. The most effective evaluation approach combines multiple metrics into a composite view that matches the priorities of your specific use case.
A practical evaluation scorecard includes accuracy on your specific task types, completion rate across all assigned tasks, median and 95th percentile cost per task, median and 95th percentile latency, token efficiency relative to task complexity, error recovery rate under realistic failure conditions, and consistency across repeated runs. Weighting these dimensions according to your use case priorities produces an overall score that predicts production performance far better than any single benchmark number.
Running this evaluation as an automated pipeline that executes on a regular schedule catches regressions from model updates, framework changes, and shifts in your workload. The upfront investment in building the pipeline pays off within weeks for any team running agents at meaningful scale.
Production agent performance depends on six dimensions beyond accuracy: completion rate, cost, latency, token efficiency, error recovery, and consistency. Measuring all six against your specific workload is the only way to predict whether an agent will deliver real value.