Token Costs Explained: What You Pay Per AI Call
What Tokens Are
A token is a chunk of text that the AI model processes as a single unit. Tokenization breaks text into pieces that do not always align with word boundaries. Common English words like "the," "and," and "is" are single tokens. Longer or less common words get split into multiple tokens. The word "understanding" might be two tokens ("under" and "standing"), while "AI" is typically one token.
The general rule of thumb is that one token equals approximately four characters of English text. This means 100 tokens is roughly 75 words, 1,000 tokens is roughly 750 words, and one million tokens is roughly 750,000 words, equivalent to about 10 average-length novels. These approximations hold well for standard English prose but vary for code, technical content, and non-English languages.
Code tokenization tends to be less efficient than prose. Whitespace, brackets, operators, and variable names each consume tokens, and indentation in languages like Python generates tokens that carry no semantic meaning. A 100-line Python function might consume 500 to 1,000 tokens depending on complexity and formatting. JSON structures are particularly token-heavy because quotation marks, colons, commas, and brackets each take one or more tokens.
Non-English languages generally require more tokens per word than English because tokenizers are trained primarily on English text. Chinese, Japanese, and Korean text can consume two to three times more tokens per character than English. Arabic, Hindi, and other scripts also tokenize less efficiently. Agents serving multilingual users should account for this when estimating costs.
Input vs Output Token Pricing
Every AI API charges different rates for input tokens and output tokens, with output tokens always costing more. This pricing asymmetry reflects the computational reality: generating new text requires more GPU compute than processing existing text.
Input tokens include everything you send to the model in a single API call. This encompasses the system prompt (instructions that define the agent's behavior), the conversation history (previous messages in the interaction), any retrieved context (documents, knowledge base entries, tool outputs), tool definitions (descriptions of tools the agent can use), and the current user message. For a typical agent call, the user's actual message might be only 50 to 200 tokens while the total input reaches 1,000 to 5,000 tokens due to all the surrounding context.
Output tokens include everything the model generates in response. This covers the agent's visible response to the user, any tool calls the model decides to make, reasoning tokens in thinking mode (which appear in the output token count but may not be visible to the user), and structured metadata like function call arguments. Output is typically shorter than input, often one-quarter to one-half the input length for conversational tasks.
The pricing ratio between input and output tokens varies by provider and model. Anthropic charges five times more for output than input across all Claude models ($3 input vs $15 output for Sonnet, for example). OpenAI's ratio ranges from four to eight times depending on the model. Google's Gemini models typically charge four to seven times more for output than input.
This pricing structure means that agent responses are more expensive per token than the prompts that generate them. An agent that generates verbose, detailed responses costs significantly more than one that provides concise, focused answers, even when the input is identical. Controlling output length through system prompt instructions and max token settings is a direct cost optimization lever.
How Agent Architecture Affects Token Consumption
The way you build your agent determines how many tokens it consumes per interaction, often more than the model choice itself. Two agents performing the same task can differ by 5x or more in token consumption based purely on architectural decisions.
System prompt design has the largest architectural impact. Every API call includes the system prompt, so its size directly multiplies across all interactions. A 500-token system prompt versus a 3,000-token system prompt means an extra 2,500 tokens per call. At 10,000 calls per day, that difference amounts to 25 million extra tokens daily, costing $75 per day on Claude Sonnet or $375 per day on Opus.
Conversation history management determines how much context accumulates over a multi-turn conversation. Without any management, the full history is sent with every call, growing linearly with each turn. A 20-turn conversation with 300 tokens per turn sends 6,000 tokens of history on the 20th call. Sliding window approaches that keep only the last N turns, summarization that compresses older history, and selective inclusion that sends only relevant prior turns all reduce this accumulation.
Tool definitions consume tokens proportional to the number and verbosity of tool descriptions. Each tool definition typically requires 100 to 300 tokens for the name, description, and parameter schema. An agent with 20 tools sends 2,000 to 6,000 extra input tokens on every call, regardless of whether any tools are relevant to the current request. Dynamic tool selection, where only the tools relevant to the detected intent are included, reduces this overhead by 60 to 80 percent.
Multi-step reasoning chains multiply total token consumption by the number of steps. An agent that makes three sequential API calls to handle a single user request consumes three times the tokens of a single-call approach. Each subsequent call typically includes growing context from previous steps, making later calls progressively more expensive. Minimizing the number of calls while maintaining quality is a constant engineering tradeoff.
Estimating Your Token Costs
To estimate your monthly token costs, you need three numbers: average tokens per interaction, interactions per day, and price per token for your chosen model.
Average tokens per interaction breaks down into input and output. For a typical conversational agent, expect 1,000 to 3,000 input tokens (system prompt plus context plus user message) and 200 to 800 output tokens (agent response). For a coding agent, expect 3,000 to 15,000 input tokens (system prompt plus code context plus instructions) and 500 to 3,000 output tokens (generated code). For a research agent, expect 5,000 to 50,000 input tokens (system prompt plus retrieved documents) and 1,000 to 5,000 output tokens (analysis and synthesis).
The monthly cost formula is straightforward: (daily interactions multiplied by average input tokens multiplied by input price per token) plus (daily interactions multiplied by average output tokens multiplied by output price per token) multiplied by 30 days. Apply a 0.5 to 0.7 multiplier if you use prompt caching effectively, and add a 1.1 to 1.2 multiplier for retries and overhead.
A worked example: a customer support agent handling 2,000 interactions per day, averaging 2,000 input tokens and 500 output tokens per interaction, running on Claude Sonnet with effective caching. Input cost per day is 2,000 interactions multiplied by 2,000 tokens multiplied by $3 per million tokens multiplied by 0.5 caching factor, equaling $6 per day. Output cost per day is 2,000 interactions multiplied by 500 tokens multiplied by $15 per million tokens, equaling $15 per day. Daily total is $21, monthly total is approximately $630 with a 1.15 retry overhead factor.
Token Optimization Techniques
Reducing token consumption directly reduces costs. The most effective optimization techniques target the largest sources of token waste without compromising agent quality.
Prompt compression involves rewriting system prompts and instructions to convey the same information in fewer tokens. Removing redundant instructions, replacing verbose descriptions with concise ones, and eliminating examples that do not improve model behavior typically reduces system prompt size by 30 to 50 percent. The key is measuring quality before and after compression to ensure no regression.
Context windowing limits the conversation history sent with each call. Instead of sending the full history, include only the last three to five turns plus a compressed summary of earlier turns. This approach keeps context manageable and reduces per-call input tokens by 50 to 80 percent in long conversations without significantly affecting response quality.
Response length control through max_tokens settings and explicit instructions to be concise reduces output token consumption. Setting max_tokens to 500 instead of the default 4,096 prevents the model from generating unnecessarily long responses. Adding instructions like "respond in under 100 words" to the system prompt further constrains output length for tasks where brevity is acceptable.
Caching at multiple levels reduces the number of API calls entirely. Application-level caching stores complete responses for identical or nearly identical queries. Semantic caching identifies when a new query is sufficiently similar to a cached query to reuse the cached response. These approaches can eliminate 20 to 50 percent of API calls for agents with repetitive workloads, removing those tokens from the bill entirely.
Output tokens cost 2 to 5 times more than input tokens, and your agent architecture determines total token consumption more than the task itself. Optimize system prompt size, manage conversation history, and control response length to reduce costs by 40 to 60 percent without changing models.