AI API Costs: What Every Model Charges
Anthropic Claude Pricing
Anthropic offers three model tiers under the Claude brand, each targeting different price-performance tradeoffs for agent workloads. All pricing is per million tokens, and all models support prompt caching at significant discounts.
Claude Opus 4 sits at the top of the lineup at $15 per million input tokens and $75 per million output tokens. This is the frontier reasoning model, designed for complex multi-step tasks, nuanced judgment calls, and creative work that demands the highest quality. For agent workloads, Opus makes sense as the final decision-maker in a routing architecture, handling the 10 to 20 percent of requests that require deep reasoning while cheaper models handle routine tasks.
Claude Sonnet 4 occupies the mid-tier at $3 per million input tokens and $15 per million output tokens. Sonnet delivers strong performance across coding, analysis, and conversational tasks at one-fifth the cost of Opus. For most agent builders, Sonnet represents the best balance of capability and cost. It handles the vast majority of agent tasks with quality that users cannot distinguish from Opus in routine interactions.
Claude Haiku 4.5 provides the budget tier at $1 per million input tokens and $5 per million output tokens. Haiku excels at classification, extraction, routing, and simple response generation. Its speed advantage makes it ideal for the lightweight preprocessing steps that agents perform before engaging more capable models for complex reasoning.
Anthropic's prompt caching reduces input token costs by 90 percent for cached content. Cached tokens on Opus cost just $1.50 per million instead of $15. For agents that reuse system prompts, tool definitions, and context across calls, caching routinely cuts total API costs by 50 to 70 percent. The cache persists for five minutes after the last use, making it effective for agents handling steady traffic.
OpenAI GPT Pricing
OpenAI maintains a broad model portfolio with pricing that spans from budget to frontier tiers. Their pricing structure has evolved to include input, output, cached, and reasoning token categories.
GPT-5.5 is the current frontier model at $5 per million input tokens and $30 per million output tokens. It competes directly with Claude Opus on complex reasoning tasks and represents OpenAI's highest-capability offering. Extended thinking mode, which allows the model to reason through complex problems step by step, charges additional tokens for the reasoning process.
GPT-5.2 offers strong capability at $1.75 per million input tokens and $14 per million output tokens. This model handles most agent workloads effectively and provides a compelling middle ground between cost and capability. For teams already invested in the OpenAI ecosystem, GPT-5.2 serves a similar role to Claude Sonnet.
GPT-4o remains widely used at approximately $2.50 per million input tokens and $10 per million output tokens. Its multimodal capabilities, handling text, images, and audio in a single model, make it useful for agents that need to process diverse input types. The model's maturity means extensive community support, pre-built integrations, and well-documented behavior patterns.
GPT-4o Mini provides the budget option at $0.15 per million input tokens and $0.60 per million output tokens. This model handles simple tasks like classification, extraction, and template filling at costs comparable to Gemini Flash, making it suitable for high-volume preprocessing steps in agent pipelines.
Google Gemini Pricing
Google's Gemini family offers some of the most competitive pricing in the market, particularly for teams that can use the free tier or the ultra-low-cost Flash models for routine agent tasks.
Gemini 2.5 Pro costs $1.25 per million input tokens for contexts up to 200,000 tokens, increasing to $2.50 for longer contexts up to the model's two-million-token window. Output tokens cost $10 per million for standard contexts. The model's massive context window makes it particularly cost-effective for agents that need to process large documents, codebases, or conversation histories without expensive summarization steps.
Gemini 2.5 Flash provides exceptional value at $0.15 per million input tokens and $0.60 per million output tokens for non-thinking mode. Thinking mode, where the model reasons through problems explicitly, costs $0.70 per million input tokens. Flash delivers strong performance on most agent tasks at a fraction of frontier model pricing.
Gemini Flash-Lite represents the floor of commercial API pricing at $0.10 per million input tokens and $0.40 per million output tokens. At these prices, even high-volume agent workloads cost just a few dollars per day. The model handles classification, simple extraction, and routing tasks adequately, making it an excellent choice for the first stage in a multi-model agent pipeline.
Google offers a generous free tier for Gemini models, providing enough daily requests for development, testing, and light personal use without any API charges. This makes Gemini an attractive option for prototyping agent architectures before committing to a paid model.
Open Source Model Costs
Open source models eliminate per-token API charges entirely, replacing them with infrastructure costs for running the model yourself. Whether this saves money depends on your usage volume, the model size you need, and your willingness to manage GPU infrastructure.
Running a capable open source model like Llama 3, Mistral, or DeepSeek on a cloud GPU instance typically costs $200 to $1,000 per month for the compute alone. An NVIDIA T4 instance suitable for smaller models runs $150 to $300 per month on major cloud providers. An A100 instance capable of running larger models costs $800 to $2,000 per month. These costs are fixed regardless of usage volume, which makes self-hosted models increasingly economical as usage grows.
The breakeven point where self-hosting becomes cheaper than API calls depends on the commercial model you are comparing against and your daily interaction volume. For teams using mid-tier models like Claude Sonnet or GPT-4o, self-hosting typically becomes cost-effective at around 50,000 to 100,000 interactions per day. For teams using budget models like Haiku or Flash, the breakeven point is much higher, often at 200,000 or more daily interactions.
DeepSeek models deserve special mention for their aggressive pricing on the hosted API. DeepSeek V3 charges approximately $0.27 per million input tokens and $1.10 per million output tokens, with cached tokens discounted 90 percent. These prices are lower than most commercial offerings while delivering competitive quality on many agent tasks.
Calculating Your Monthly API Bill
Estimating your monthly API costs requires three data points: the average number of tokens per interaction (both input and output), the number of interactions per day, and the per-token price of your chosen model.
A typical agent interaction consumes 500 to 2,000 input tokens (system prompt, context, user message) and generates 200 to 800 output tokens (the agent's response). Complex tasks with large context windows can consume 10,000 to 100,000 input tokens per interaction.
For a concrete example, consider a customer support agent handling 1,000 conversations per day with an average of 1,500 input tokens and 500 output tokens per conversation. On Claude Sonnet at $3/$15 per million tokens, the daily cost would be $4.50 for input and $7.50 for output, totaling $12 per day or roughly $360 per month. The same workload on Claude Haiku would cost $1.50 for input and $2.50 for output, totaling $4 per day or $120 per month. On Gemini Flash, the cost drops to under $1 per day or $30 per month.
These estimates assume no caching. With prompt caching enabled and a stable system prompt of 1,000 tokens reused across all interactions, the cached portion reduces input costs by approximately 90 percent. For the Sonnet example, caching would cut the monthly bill from $360 to roughly $230, saving over $1,500 per year.
Model selection is the single most impactful cost decision for any AI agent. The price difference between frontier and budget models is 50x or more, and most agent tasks do not require frontier-tier reasoning. Start with a mid-tier model, measure quality, and downgrade where you can without user-visible impact.