AI Agent Cost Benchmarks: Efficiency Metrics
What Drives Agent Costs
Agent costs are dominated by LLM inference, which is priced per token for both input and output. Input tokens include the system prompt, task description, tool results, conversation history, and any retrieved context. Output tokens include the agent's reasoning, tool call requests, and final responses. Every step in a multi-step agent workflow generates both input and output tokens, and the cost accumulates across all steps.
The number of steps is the primary cost multiplier. A single-pass agent makes one LLM call. A planning agent that thinks before acting makes two or three. An agent with tool use, verification, and reflection might make ten to twenty calls per task. A multi-agent system where multiple specialized agents collaborate might make fifty or more. Each additional step multiplies the base cost of a single inference call by the number of steps.
Context accumulation is the second major cost driver. In a multi-step workflow, each subsequent LLM call typically includes the results of all previous steps in its input context. This means the input token count grows with each step: the first call might include 2,000 input tokens, the fifth call might include 10,000, and the fifteenth might include 40,000. This expanding context is why long-running agent tasks can become disproportionately expensive.
Model choice creates the widest cost variation for the same architecture. Running identical agent logic with Claude Opus versus Claude Haiku can produce a 15-30x cost difference per task. GPT-4o versus GPT-4o-mini shows a similar spread. The performance difference between these tiers is real but not proportional to the cost difference, which means using a cheaper model for simpler subtasks within an agent workflow is one of the most effective cost optimization strategies available.
Cost Ranges by Agent Architecture
Single-pass agents represent the cheapest architecture. The agent receives a task, generates a response in one LLM call, and returns the result. Typical costs range from $0.005 to $0.05 per task depending on the model and context length. This architecture works for tasks that do not require tool use, planning, or verification, like simple classification, summarization, and question answering with provided context.
ReAct-style agents alternate between reasoning and acting, making a series of LLM calls interspersed with tool executions. A typical task involves 3-8 reasoning steps, each followed by a tool call. Costs range from $0.05 to $0.50 per task. The variance depends heavily on how many steps the task requires and how quickly the agent converges on a solution. Tasks where the agent explores unproductive paths before finding the right approach cost significantly more than tasks where the agent proceeds efficiently.
Planning agents add an explicit planning phase before execution. The agent first generates a plan, then executes each step, and may revise the plan based on intermediate results. This adds 1-3 additional LLM calls compared to a purely reactive agent. Costs typically range from $0.10 to $1.00 per task. The planning overhead pays for itself when it prevents wasted execution steps, but adds cost without benefit for simple tasks that do not require planning.
Multi-agent systems assign different roles to specialized agents that communicate and coordinate. A coding task might involve a planner agent, a coder agent, a reviewer agent, and a tester agent. Each agent makes its own LLM calls, and inter-agent communication generates additional tokens. Costs range from $0.50 to $10.00 per task, with complex tasks at the higher end requiring extensive collaboration between agents. The accuracy improvements from multi-agent architectures must justify this cost premium, which is straightforward for high-value tasks but difficult for high-volume, low-value ones.
Reflection and verification loops add cost by deliberately duplicating work for quality assurance. An agent that generates an answer and then critiques its own answer before finalizing roughly doubles the cost. An agent that generates multiple candidate answers and selects the best one multiplies cost by the number of candidates. These patterns reliably improve accuracy but at a direct cost multiplier that must be weighed against the value of the accuracy improvement.
Cost Efficiency Metrics
Cost per successful completion is the most useful efficiency metric because it accounts for both the cost of completing tasks and the cost of failed attempts. If an agent costs $0.50 per task attempt but only succeeds 60% of the time, the effective cost per successful completion is $0.83. A more expensive agent at $0.80 per attempt with 90% success rate has an effective cost of $0.89 per success, which is comparable despite the higher per-attempt cost.
Tokens per step measures how efficiently the agent uses its context at each reasoning step. An efficient agent that loads only relevant information and produces concise reasoning might use 1,000-2,000 tokens per step. An inefficient agent that loads excessive context and produces verbose reasoning might use 5,000-10,000 tokens per step. Optimizing tokens per step through better prompting and context management is often the fastest path to meaningful cost reduction.
Steps per task measures how many LLM calls the agent makes to complete a task. Fewer steps at the same accuracy means better efficiency. This metric reveals whether the agent is exploring unnecessary paths, making redundant tool calls, or failing to plan effectively before acting. Comparing steps per task across different agent configurations shows which architectural decisions improve efficiency.
Cost-accuracy ratio normalizes cost against quality. Divide the cost per task by the accuracy percentage to get a cost-per-unit-of-accuracy figure. This metric lets you compare architectures that trade cost for accuracy on a common scale. A system achieving 80% accuracy at $0.20 per task has a cost-accuracy ratio of $0.0025, while one achieving 95% accuracy at $0.80 has a ratio of $0.0084. Whether the higher accuracy justifies the higher ratio depends on the value of correct completions in your specific application.
Optimization Strategies
Model routing assigns different subtasks to different model tiers based on complexity. Planning and analysis tasks that require strong reasoning go to capable, expensive models. Simple tool calls, formatting, and data extraction go to cheaper, faster models. This approach typically reduces total cost by 40-60% compared to running everything on the most capable model, with minimal impact on accuracy because the cheaper models are adequate for the simpler subtasks.
Prompt caching reduces costs by reusing computation for repeated prompt prefixes. When multiple tasks share the same system prompt and instructions, caching allows the model to process the shared prefix once and reuse it across subsequent calls. Anthropic's prompt caching reduces the cost of cached input tokens by 90%, making it one of the most effective optimizations for agents that process many tasks with similar setups. The impact is largest for agents with long system prompts and short task-specific inputs.
Context pruning keeps the input context lean by removing information that is no longer relevant as the task progresses. Instead of including the full history of all previous steps in every LLM call, the agent includes only the information relevant to the current step. This requires careful management of what to keep and what to drop, but the cost savings from reduced input tokens are substantial for tasks with many steps.
Early termination saves cost by stopping the agent when it determines that further effort is unlikely to improve the result. If an agent recognizes after three steps that a task is beyond its capability, terminating early rather than spending ten more steps on a futile attempt saves significant cost. Implementing effective early termination requires the agent to have accurate self-assessment, which itself is a capability that varies across models.
Batch processing groups multiple similar tasks into a single session, amortizing the cost of context loading across multiple completions. This is most effective when tasks share common context like the same codebase, the same document set, or the same tool configuration. Processing ten tasks in a batch that shares context can cost 30-50% less than processing them individually.
Measuring ROI
The ultimate cost benchmark is return on investment: does the agent save more than it costs? This calculation requires comparing the total cost of agent operation against the cost of the alternative, which is usually human labor or manual automation.
For a customer support agent handling tier-one tickets, the calculation might be: the agent costs $0.15 per ticket and resolves 70% of tickets without human intervention. The human cost of handling the same tickets is $8.00 per ticket. For every 100 tickets, the agent handles 70 at $0.15 each ($10.50) while humans handle 30 at $8.00 each ($240.00), totaling $250.50. Without the agent, all 100 tickets cost $800.00. The agent saves $549.50 per 100 tickets, a clear positive ROI.
For a coding agent, the ROI calculation includes both time savings and quality impact. If the agent reduces the time to resolve a routine bug from two hours to twenty minutes (including human review of the agent's patch), and the developer's loaded cost is $100 per hour, the agent saves $133 per bug. If the agent costs $2.00 per bug resolution attempt and succeeds 60% of the time, the net savings per successful resolution is about $130.
Negative ROI scenarios exist and are worth understanding. Agents deployed on tasks that are already automated efficiently, tasks where the error cost exceeds the labor savings, or tasks that require so much human oversight that the agent adds workflow complexity without reducing labor. Measuring ROI honestly, including all costs and accounting for error rates, prevents deployment of agents that create more problems than they solve.
Agent costs range from $0.01 to $10.00 per task depending on architecture and model choice. Measure cost per successful completion rather than cost per attempt, optimize through model routing and prompt caching, and validate positive ROI against the real alternative cost before scaling deployment.