Circuit Breaker Pattern for AI Pipelines
The Three States
A circuit breaker exists in one of three states: closed, open, or half-open. Understanding these states is key to understanding the pattern.
Closed is the normal operating state. All requests pass through the breaker to the downstream service. The breaker monitors the results, counting successes and failures. As long as the failure rate stays below the configured threshold, the breaker remains closed and requests flow normally.
Open is the protection state. When the failure rate exceeds the threshold, the breaker trips open. All subsequent requests are immediately rejected without contacting the downstream service. This prevents the agent from wasting time and resources on calls that will almost certainly fail, and prevents the failing service from being overwhelmed by continued requests. The breaker stays open for a configured timeout period.
Half-open is the recovery probe state. After the timeout expires, the breaker transitions to half-open, allowing a limited number of test requests through. If these test requests succeed, the breaker closes and normal traffic resumes. If they fail, the breaker opens again for another timeout period. This gradual recovery prevents a flood of requests from crashing a service that is just coming back online.
Why AI Agents Need Circuit Breakers
AI agents depend on external services more heavily than typical applications. Every reasoning step requires an LLM API call. Tool execution often involves external APIs, databases, or web services. Memory retrieval calls vector databases. Each of these dependencies can fail, and without circuit breakers, failures cascade through the pipeline.
Consider an agent processing a queue of tasks. The LLM API starts returning 503 errors due to capacity issues. Without a circuit breaker, the agent sends request after request, each timing out after 30 seconds. A queue of 100 tasks that would normally take 5 minutes now takes 50 minutes just to fail, burning API credits on timeout retries and blocking the pipeline for other work.
With a circuit breaker, the first few failures trigger the breaker to open. Subsequent requests fail immediately (in milliseconds, not seconds). The agent can handle the failure gracefully: queue the tasks for later, switch to a fallback model, or notify an operator. When the API recovers, the half-open state probes detect the recovery, and normal processing resumes automatically.
Configuring Thresholds
The circuit breaker configuration determines how sensitive it is to failures and how quickly it recovers. The key parameters are failure threshold, timeout duration, and success threshold.
Failure threshold is the number or percentage of failures that trips the breaker open. A threshold of 50% over a window of 10 requests means the breaker opens when 5 out of 10 requests fail. Lower thresholds make the breaker more sensitive (trips faster) but risk false positives from normal error rates. Higher thresholds are more tolerant but allow more failed requests through before tripping.
For LLM API calls, a failure threshold of 50% over a 20-request window is a reasonable starting point. This allows for occasional transient errors without tripping, while catching sustained outages quickly. For tool calls with higher natural failure rates (like web scraping, which might fail on 10-20% of requests normally), a higher threshold of 70-80% prevents unnecessary tripping.
Timeout duration is how long the breaker stays open before transitioning to half-open. Short timeouts (10-30 seconds) work for services that recover quickly. Long timeouts (1-5 minutes) work for services that take time to recover, like overloaded LLM APIs during peak traffic.
Success threshold in the half-open state is how many consecutive successful requests are required before the breaker fully closes. A threshold of 3-5 consecutive successes prevents premature closure when the service is still unstable.
Circuit Breakers for LLM APIs
LLM API calls deserve their own circuit breaker with configuration tuned to the specific failure patterns of model providers. Key considerations include rate limit handling, model fallback, and cost protection.
Rate limit errors (HTTP 429) should be treated differently from server errors (HTTP 500/503). Rate limits are predictable and include Retry-After headers that tell you exactly when to try again. A circuit breaker for rate limits should respect the Retry-After header rather than using its own timeout. Server errors are less predictable and benefit from standard exponential backoff.
When the circuit breaker for the primary LLM opens, the agent can fall back to an alternative model. Many applications can switch between providers (OpenAI to Anthropic, or GPT-4 to a smaller model) with acceptable quality degradation. The circuit breaker for the primary provider stays open while the fallback handles requests, and the half-open probe periodically checks whether the primary is back.
Circuit breakers also provide natural cost protection. An agent stuck in a retry loop against a failing API can consume significant credits before anyone notices. When the circuit breaker trips, it immediately stops the credit burn and alerts operators to investigate.
Circuit Breakers for Tool Calls
Each tool an agent uses should have its own circuit breaker. A web scraping tool and a database tool fail for different reasons, at different rates, and recover at different speeds. Sharing a circuit breaker between them means that a web scraping failure could block database queries, even though the database is perfectly healthy.
Tool circuit breakers should track both errors and timeouts. A tool that returns errors quickly is less damaging than a tool that hangs, consuming a thread or connection for minutes before timing out. Timeouts should be counted as failures for circuit breaker purposes, and tool calls should always have explicit timeout limits set before the circuit breaker threshold.
For tools with natural retry semantics (like web scraping, where a page might load on the second try), the circuit breaker should sit outside the retry loop. Let the tool retry internally 2-3 times, and only count a failure at the circuit breaker level if all retries are exhausted. This prevents the circuit breaker from tripping on normal retry behavior.
Monitoring Circuit Breaker State
Circuit breaker state is one of the most valuable operational signals in an AI agent system. A breaker transitioning from closed to open means a dependency is failing. A breaker staying open for extended periods means the dependency is not recovering. A breaker oscillating between open and closed means the dependency is unstable.
Log every state transition with the circuit name, the new state, the error count that triggered the transition, and the time. Alert on transitions to open state so operators can investigate. Track the percentage of time each breaker spends in each state as a reliability metric for the corresponding dependency.
Dashboard the current state of all circuit breakers in one view. During an incident, this dashboard immediately shows which dependencies are healthy (closed), which are down (open), and which are recovering (half-open). This saves valuable diagnosis time compared to searching through logs.
Circuit breakers protect AI agents from cascading failures by automatically stopping calls to failing services. Configure separate breakers for each dependency with thresholds tuned to their specific failure patterns. Monitor breaker state transitions as a primary operational health signal.