AI Agent SDK Pricing and Token Costs
Base Token Pricing (May 2026)
Claude (Anthropic) offers three model tiers. Haiku 4.5 at per million input tokens and per million output tokens is the economy option for simple tasks. Sonnet 4.6 at /5 is the workhorse model for most agent applications. Opus 4.7 at /5 is the flagship for complex reasoning and multi-step problem solving.
OpenAI also offers tiered pricing. GPT-5.5 at /0 is the most capable model with the highest output cost. GPT-5.2-Codex at .75/4 is optimized for coding and agentic workflows. Smaller models are available at lower price points for lightweight tasks.
Google provides the most price-competitive options. Gemini 3.1 Pro at /2 in standard mode (doubling beyond 200K tokens) is the mid-tier flagship. Gemini 3 Flash at /bin/bash.50/ offers the best price-to-performance ratio in the market. Flash-Lite variants push costs even lower at /bin/bash.10//bin/bash.40 for the simplest tasks.
Vercel AI SDK has no per-token cost of its own since it is free and open source. Your costs are entirely determined by which model provider you connect to. This means Vercel users have the most flexibility to optimize costs by routing different tasks to different providers.
Why Agent Workloads Cost More
A single API call sends a prompt and receives a response. An agent makes many API calls in a loop, and each subsequent call includes the growing conversation history. A typical agent task might involve 10 to 30 model calls, with each call including the system prompt, the conversation so far, and the latest tool results. This means the later calls in the loop can be consuming 50,000+ tokens of input just to maintain context, even if the new information being added is only a few hundred tokens.
Tool call formatting adds overhead too. Each tool call requires a structured JSON schema describing the tool, the model's formatted arguments, and the tool's response. For an agent with 20 available tools, the tool descriptions alone might consume 3,000 to 5,000 tokens per call.
Multi-agent systems multiply these costs because each agent in the system maintains its own conversation context and makes its own model calls. A three-agent system doing the same work as a single agent might cost 2 to 3 times more in tokens, though the results are often higher quality because each agent can focus on its specialty.
Prompt Caching
Prompt caching is the single most impactful cost optimization for agent workloads. All three major providers (Anthropic, OpenAI, Google) offer prompt caching that reduces the cost of repeated input tokens by approximately 90%. Cached tokens are billed at 10% of the standard input price.
In an agent loop, the system prompt and tool descriptions are identical in every call. With caching enabled, these tokens are only billed at full price on the first call and at the 90% discount on all subsequent calls. For an agent that makes 20 calls with a 4,000-token system prompt, caching reduces the system prompt cost from 80,000 input tokens at full price to 4,000 at full price plus 76,000 at 10%, saving roughly 72,000 tokens worth of cost.
Caching also applies to the conversation history that overlaps between calls. Each call typically shares most of its conversation history with the previous call, so the overlapping portion is served from cache. The new information (the latest tool result or model response) is the only part billed at full input price.
Batch API Discounts
Both Anthropic and OpenAI offer Batch APIs that provide a 50% discount on both input and output tokens in exchange for higher latency (results may take minutes to hours rather than seconds). For agent workloads that do not require real-time interaction, batch processing can cut costs in half.
Batch APIs are well-suited for background processing tasks like document analysis, code review, data extraction, and report generation. They are not suitable for interactive agents that need to respond to users in real time. A common pattern is using the standard API for user-facing interactions and the Batch API for scheduled or background agent tasks.
Real-World Cost Estimates
A simple agent that performs a 10-step task using Claude Sonnet 4.6 with prompt caching enabled typically consumes 150,000 to 300,000 tokens total (input plus output combined). At Sonnet pricing with caching, this costs approximately /bin/bash.50 to .50 per task.
A complex agent that performs a 50-step code review using Claude Opus 4.7 with prompt caching might consume 800,000 to 1.5 million tokens. At Opus pricing with caching, this costs approximately to 5 per review.
The same 10-step task on Google Gemini 3 Flash costs approximately /bin/bash.10 to /bin/bash.30, making it 3 to 5 times cheaper than Sonnet for comparable quality on routine tasks. For tasks where Flash's capabilities are sufficient, the cost savings are substantial at scale.
Teams running hundreds of agent tasks per day should expect monthly API costs in the range of 00 to ,000 depending on model choice, task complexity, and caching effectiveness. Enterprise deployments with thousands of daily tasks should budget ,000 to 0,000 monthly and invest in cost monitoring and optimization from the start.
Cost Optimization Strategies
Use the cheapest model that works for each task. Not every agent step requires a flagship model. Triage, routing, and simple classification tasks can often be handled by economy models at a fraction of the cost. Reserve flagship models for steps that require complex reasoning.
Minimize context window usage. Keep system prompts concise. Remove tool descriptions that are not relevant to the current task. Implement context compaction (as Claude's SDK does automatically) to prevent the conversation history from growing unboundedly.
Enable prompt caching on every request. There is no downside to caching, and the cost savings for agent workloads are substantial. Ensure your SDK configuration has caching enabled, and structure your prompts so that the cacheable portion (system prompt, tool descriptions) appears at the beginning of the message.
Use batch processing for non-interactive tasks. If the agent does not need to respond in real time, the 50% batch discount is essentially free money. Restructure background workflows to use the Batch API wherever possible.
Monitor and set budgets. All three providers offer usage dashboards and spending alerts. Set per-project and per-day budgets to prevent unexpected cost spikes from runaway agent loops or misconfigured retry logic.
Subscription Plans and Credits
Anthropic introduced dual-bucket billing for Claude subscription holders in June 2026. Claude Pro, Team, and Enterprise subscribers receive a monthly Agent SDK credit that can be applied to agent API usage. Agent SDK usage no longer counts toward plan limits, effectively separating interactive Claude usage from programmatic agent usage. This makes the Claude Agent SDK more accessible for individual developers who want to build agents without committing to a separate API billing account.
OpenAI offers a similar approach through its API platform, with usage-based billing and configurable spending limits. Enterprise customers can negotiate custom pricing based on volume commitments. The Codex model tier provides a cost-effective option specifically for agentic coding tasks.
Google Cloud customers can leverage committed use discounts and negotiated enterprise pricing for Gemini API access. The Gemini Flash tier at /bin/bash.50/ already represents the lowest entry point among the three major providers, and volume discounts can reduce costs further.
For teams evaluating SDKs, the total cost calculation should include the model provider pricing, any platform fees (Google Cloud usage, Vercel hosting), and the engineering time required to implement and maintain the agent system. The cheapest model is not always the cheapest solution if it requires significantly more engineering work to achieve acceptable results.
Agent workloads cost 5 to 50 times more than single-turn API calls, but prompt caching (90% savings on repeated input), batch processing (50% off), and model tiering can reduce costs by 60 to 80% compared to naive usage patterns.