Retry Strategies for AI Agents
When to Retry vs. When to Fail
The most important retry decision is not how to retry, but whether to retry at all. Retrying a transient error (network timeout, rate limit, temporary server overload) is productive. Retrying a permanent error (invalid API key, malformed request, deprecated endpoint) is wasteful and delays error reporting.
Classify errors into three categories before deciding on retry behavior. Definitely retryable: HTTP 429 (rate limited), 503 (service unavailable), network timeout, connection reset. Possibly retryable: HTTP 500 (internal server error), which could be transient or persistent. Never retryable: HTTP 400 (bad request), 401 (unauthorized), 403 (forbidden), 404 (not found).
For AI agents specifically, certain model-level errors also need classification. A response that exceeds the output token limit might succeed with a shorter prompt (retryable with modification). A content filter rejection will produce the same result on retry (not retryable). A model overload error is transient (retryable with backoff).
Exponential Backoff
Exponential backoff is the standard retry strategy for distributed systems. Instead of retrying immediately, the agent waits for increasing intervals between attempts: 1 second, then 2 seconds, then 4 seconds, then 8 seconds, and so on. Each retry waits twice as long as the previous one.
The mathematical formula is straightforward: wait time = base * (2 ^ attempt), where base is the initial wait time and attempt is the retry number (starting from 0). With a base of 1 second, the waits are 1, 2, 4, 8, 16, 32 seconds for attempts 0 through 5.
Exponential backoff works because transient failures usually resolve quickly (within seconds), so the first few retries catch most recoveries. If the failure persists, the increasing wait times prevent the agent from hammering the service and making the problem worse. The exponential growth means that even after many retries, the total time invested grows slowly relative to the potential recovery time.
Always set a maximum wait time (cap) to prevent unreasonably long delays. A cap of 60 to 120 seconds is typical for LLM API calls. Without a cap, the 10th retry would wait over 17 minutes (1024 seconds), which is rarely useful.
Jitter
Pure exponential backoff has a subtle problem: when multiple agents experience the same failure simultaneously (like an API outage), they all retry at exactly the same intervals. The first retry at 1 second creates a burst. The second retry at 2 seconds creates another burst. These synchronized retry waves, called thundering herd, can overwhelm a recovering service and prevent it from stabilizing.
Jitter adds randomness to the wait time to desynchronize retries across agents. There are several jitter strategies.
Full jitter: wait time = random(0, base * 2^attempt). The actual wait is a random value between 0 and the exponential backoff time. This provides maximum spread but sometimes retries too quickly (when the random value is near 0).
Equal jitter: wait time = (base * 2^attempt / 2) + random(0, base * 2^attempt / 2). The wait is half the exponential time plus a random component. This guarantees a minimum wait while still spreading retries.
Decorrelated jitter: wait time = random(base, previous_wait * 3). Each wait is random between the base and three times the previous wait. This produces good spread without strict exponential growth and tends to work well in practice.
For AI agent systems with multiple concurrent agents sharing the same API provider, jitter is not optional. Without it, a provider outage followed by recovery will be met with a synchronized wall of retry requests that may immediately cause another outage.
Retry Budgets
A retry budget limits the total resources an agent can spend on retries across all operations within a time window. Instead of configuring per-operation retry limits, you set a global budget: the agent can retry up to 10% of its total requests within any 60-second window.
Retry budgets solve a problem that per-operation limits cannot: when many operations fail simultaneously, per-operation retries multiply. If 100 operations each retry 3 times, the system sends 300 retry requests in addition to the 100 original requests, quadrupling the load on the failing service. A retry budget of 10% would allow only 10 retries total, keeping the load manageable.
For AI agents, retry budgets are particularly useful because agents often make many sequential API calls within a single task. A budget prevents a cascade of retries from consuming the entire task execution time. If the budget is exhausted, the agent fails fast and reports the error rather than spending minutes on futile retries.
Retry with Modification
Some failures can be resolved not by retrying the same request, but by modifying the request before retrying. This is particularly relevant for AI agents, where the request content (prompt, parameters, context) can be adjusted.
Context reduction: if a request fails because the context window is too large, the agent can summarize or truncate the context and retry with a shorter prompt. This addresses context overflow errors without losing the task.
Model downgrade: if the primary model is unavailable or overloaded, the agent can retry with a different model. Switching from GPT-4 to GPT-3.5, or from Claude Opus to Claude Haiku, provides a lower-quality but functional response.
Parameter adjustment: if a request fails with specific parameter values (like a temperature that produces invalid output or a max_tokens that is too large), the agent can adjust the parameters and retry.
Request splitting: if a request fails because it is too complex, the agent can split it into smaller sub-requests and retry each one independently. A query that asks for analysis of 50 items can be split into 5 queries of 10 items each.
Implementing Retries in Practice
A well-implemented retry wrapper for AI agent API calls includes several components working together. The wrapper classifies the error to determine if retry is appropriate. It applies the backoff algorithm (exponential with jitter) to calculate the wait time. It checks the retry budget to ensure retries are not exhausted. It optionally modifies the request based on the error type. And it logs each retry attempt with the error, wait time, and attempt number for debugging and monitoring.
The retry loop should be separate from the business logic. The agent code should make a simple function call and receive either a successful result or a final error. The retry logic handles the complexity of repeated attempts, backoff timing, and budget management internally. This separation makes the agent code cleaner and the retry logic reusable across different operations.
Always log the final outcome: whether the operation succeeded on retry (and which attempt), or whether it failed after exhausting retries. This data is essential for tuning retry parameters. If most successes happen on the first retry, your backoff might be too aggressive. If operations frequently exhaust all retries without success, the underlying dependency might have a persistent problem that retries cannot fix.
Use exponential backoff with jitter for all retryable operations. Classify errors before retrying, never retry permanent errors. Set retry budgets to prevent retry storms during widespread failures. Consider retry with modification (model downgrade, context reduction) for AI-specific failure modes.