How to Set Up Model Fallbacks
Every cloud AI provider experiences occasional disruptions. Rate limits hit during traffic spikes, regional outages take services offline for minutes or hours, and API changes introduce unexpected errors. For agent systems that need to stay operational, relying on a single provider is a single point of failure. Fallback chains eliminate this risk by automatically redirecting requests to alternative providers when the primary option fails.
Identify Failure Scenarios
Before configuring fallbacks, understand the types of failures your system needs to handle. The most common scenarios are rate limiting (the provider throttles your requests because you exceeded your quota), server errors (the provider returns 500-series HTTP errors indicating a temporary problem), timeouts (the provider does not respond within your configured window), and full outages (the provider is completely unreachable).
Each failure type has different characteristics. Rate limits are predictable and often come with retry-after headers telling you when to try again. Server errors are usually transient and resolve within minutes. Timeouts may indicate either provider issues or an unusually complex request. Full outages are rare but can last for extended periods.
Your fallback strategy should handle all of these scenarios, but the response to each may differ. A rate limit might trigger a brief wait before retrying the same provider, while a full outage should immediately redirect to an alternative. Document your expected behavior for each scenario before you start configuring.
Choose Fallback Models
Select fallback models from different providers than your primary choice. The goal is provider diversity. If your primary workhorse model is Claude Sonnet, choose GPT-5 or Gemini 2.5 Pro as fallbacks, not a different Claude model. Provider-level outages affect all models from that provider, so same-provider fallbacks do not protect against the most impactful failure scenario.
Match fallback models to the capability level of the primary model as closely as possible. If your workhorse tier uses Claude Sonnet, the fallback should be a workhorse-tier model from another provider, not an economy model. Downgrading capability during fallback can cause quality issues that are harder to detect than a simple error message.
For the economy tier, consider adding a local Ollama model as the last fallback option. Local models are not affected by cloud provider outages, rate limits, or internet connectivity issues. They are slower and less capable, but for simple tasks they provide a reliable last resort that keeps your agent system functional under any conditions.
You do not need three or four fallbacks for every tier. Two alternatives (primary plus two fallbacks) provide strong resilience. Beyond that, the probability of all three providers failing simultaneously is negligible, and additional fallback levels add complexity without meaningful reliability improvement.
Configure Fallback Chains
In LiteLLM, fallback chains are configured as ordered lists of model deployments. When a request to the first model in the chain fails, LiteLLM automatically tries the next model. The chain continues until a request succeeds or all options are exhausted.
Order your fallback chain by preference. The primary model should be first (the one you prefer for cost, quality, or both reasons), followed by the best alternative, followed by the last resort. LiteLLM always tries models in the configured order, so put your preferred options first.
Configure separate fallback chains for each tier. The workhorse fallback chain should contain workhorse-class models. The economy fallback chain should contain economy-class models. Mixing tiers in a single chain can cause unexpected cost spikes when a cheap primary model fails and traffic shifts to an expensive fallback.
If your routing layer is not LiteLLM, the same principles apply. Implement the fallback logic as a try/catch wrapper around your model calls. Catch provider-specific exceptions, log the failure, and retry with the next model in the chain. Most provider SDKs throw distinct exceptions for rate limits, server errors, and timeouts, so you can tailor your retry behavior to each error type.
Set Timeout and Retry Policies
Timeouts determine how long to wait before declaring a request failed and moving to the fallback. Set them based on the expected response time for each model tier. Economy models typically respond in 1 to 5 seconds, workhorse models in 5 to 30 seconds, and frontier models in 10 to 60 seconds for complex tasks. Set your timeout at roughly 2 to 3 times the expected response time to allow for variability without waiting excessively.
Retry policies control how many times to retry the same model before falling back. For transient errors (server 500, connection reset), one or two retries with exponential backoff are reasonable. For rate limits, respect the retry-after header if provided, or back off for 30 to 60 seconds. For authentication errors or 400-series client errors, do not retry because the same request will fail again.
Set a total timeout for the entire fallback chain, not just individual requests. If your chain has three models with 30-second timeouts each, the worst case is 90 seconds of waiting before the user gets an error. For latency-sensitive applications, reduce individual timeouts or limit the chain length to keep total response time acceptable.
Avoid aggressive retries that can make the problem worse. If a provider is overloaded and rate limiting you, sending more requests faster does not help. Exponential backoff (waiting progressively longer between retries) reduces pressure on the struggling provider and is more likely to succeed.
Test Failover Behavior
Testing fallbacks requires simulating failures, which is harder than testing normal operation. The simplest approach is to configure an intentionally invalid API key for your primary model and verify that requests fall through to the fallback model correctly. Then restore the valid key and repeat with the second model to test the full chain.
Test each failure type separately. Simulate a timeout by setting an unrealistically short timeout (100 milliseconds). Simulate a server error by pointing at an endpoint that returns 500 responses. Simulate a rate limit by sending more requests than your quota allows in a short burst. Each failure type should trigger the appropriate fallback behavior.
Verify that fallback responses are usable, not just that the system does not crash. Compare the output quality from your fallback models against your primary model for representative tasks. If the fallback model produces significantly worse output for certain task types, you need to account for that in your routing logic or choose a better fallback model.
Run failover tests periodically, not just during initial setup. Provider APIs change, model capabilities evolve, and your routing configuration may drift from the tested state. A quarterly failover test catches configuration issues before they matter in a real outage.
Monitor Fallback Frequency
Once fallbacks are live, track how often each fallback level triggers and why. A well-functioning system should use fallbacks infrequently, perhaps 1 to 5 percent of total requests under normal conditions. If fallback frequency is consistently higher, investigate whether the primary provider has reliability issues or whether your timeout and retry policies need adjustment.
Log every fallback event with the reason (timeout, rate limit, server error), the primary model that failed, and the fallback model that handled the request. This data reveals patterns: maybe one provider is consistently slower at certain times of day, or your rate limits are too low for your traffic volume.
Monitor cost impact during fallback periods. If your primary model is cheaper than your fallback, extended fallback periods can increase costs. If this happens frequently, consider adjusting your model assignments so that the primary and fallback models are at similar price points, or budget for occasional fallback cost overruns.
Set up alerts for abnormal fallback rates. A sudden spike in fallbacks often indicates a provider issue that may require attention, like upgrading your API tier to get higher rate limits, or contacting the provider about recurring errors. Catching these patterns early prevents them from becoming persistent problems.
Effective fallbacks require provider diversity (not same-provider alternatives), appropriate timeouts, and regular testing. Configure separate fallback chains per tier, test failover behavior periodically, and monitor fallback frequency to catch reliability issues early.