Graceful Degradation: When AI Models Go Down
The Degradation Spectrum
Degradation is not binary. Between "fully functional" and "completely down" lies a spectrum of reduced capability that a well-designed system can navigate deliberately.
Full capability: all models, tools, and data sources are available. The agent operates at peak quality and speed.
Reduced quality: the primary model is unavailable, so the agent uses a smaller or older model. Responses are less nuanced but still useful. This is the most common degradation level for LLM-dependent agents.
Reduced scope: some tools or data sources are unavailable. The agent can still handle tasks that do not require the missing capabilities. A customer support agent without access to the order database can still answer product questions and FAQs.
Read-only mode: the agent can provide information and answer questions but cannot perform actions or make changes. This prevents damage during periods of uncertain system state while still serving users who need information.
Cached responses: the agent serves pre-computed or previously generated responses for common queries. This covers a surprising percentage of real traffic, since most questions follow Zipf-like distributions where a small number of questions account for a large share of volume.
Queue and defer: the agent accepts tasks and queues them for processing when the service recovers, rather than rejecting them outright. Users receive confirmation that their request was received and will be processed, which is significantly better than an error message.
Model Fallback Chains
A model fallback chain defines a priority-ordered list of models the agent can use, switching to the next model when the current one is unavailable. A typical chain might be: Claude Opus (primary) -> Claude Sonnet (fast fallback) -> GPT-4o (cross-provider fallback) -> a local open-source model (offline fallback).
Each step down the chain trades quality for availability. Claude Opus provides the best reasoning for complex tasks, but if Anthropic API is down, Claude Sonnet on a different endpoint or GPT-4o from OpenAI provides good results. If all cloud APIs are down, a local model like Llama provides basic functionality without any external dependency.
The fallback decision should be automatic, triggered by circuit breaker state changes. When the circuit breaker for the primary model opens, the agent immediately starts using the next model in the chain. When the primary model recovers (circuit breaker transitions to half-open, then closed), the agent switches back. No human intervention is needed.
Important: test your fallback chain regularly. A fallback that has never been exercised in production is a fallback that might not work when you need it. Schedule periodic tests where you deliberately disable the primary model and verify that the fallback chain activates correctly, produces acceptable results, and recovers cleanly.
Capability-Based Degradation
Instead of switching entire models, capability-based degradation selectively disables features that depend on the failing component while keeping everything else running.
An AI agent might have these capabilities: natural language understanding (requires LLM), web search (requires search API), document retrieval (requires vector database), email sending (requires SMTP), and structured data lookup (requires SQL database). If the vector database fails, the agent loses document retrieval but retains all other capabilities. Users who need document search get an appropriate message; users who need other services are unaffected.
This approach requires the agent to understand its own capability graph, knowing which features depend on which services. At startup or configuration change, the agent builds this map. When a service fails, it consults the map to determine which capabilities are affected and adjusts its behavior accordingly. It might update its system prompt to avoid suggesting actions it cannot perform, or proactively inform users about temporarily unavailable features.
Cached and Pre-Computed Responses
Caching previous responses provides a powerful degradation layer for agents that handle repetitive queries. When the LLM is unavailable, the agent can serve cached responses for queries that match or closely resemble previous queries.
Semantic caching, using embeddings to match queries by meaning rather than exact text, extends the coverage of the cache significantly. "How do I reset my password?" and "I forgot my password, how do I change it?" are different strings but the same question, and a semantic cache can serve the same response for both.
Cache freshness is critical. A cached response about product pricing that is six months old is worse than no response, because it provides confident but incorrect information. Implement time-to-live (TTL) policies that expire cached responses based on the volatility of the underlying information. Static information (documentation, procedures) can be cached for weeks. Dynamic information (pricing, availability) should expire within hours.
When serving a cached response, be transparent about it. A subtle indicator that the response comes from cache rather than a live model allows users to evaluate the freshness and reliability of the information. This honesty preserves trust even during degraded operation.
Queue-Based Deferral
For tasks that are not time-critical, the best degradation strategy is to accept the task, queue it, and process it when the service recovers. This applies to batch processing, report generation, data analysis, and other background work where a delay of minutes or hours is acceptable.
The key design consideration is setting expectations correctly. When accepting a deferred task, tell the user the expected delay and provide a way to check status. "Your request has been received and will be processed within 2 hours. You can check status at any time." This is vastly better than "Service unavailable, try again later."
Queue-based deferral also provides natural load smoothing during partial outages. If the fallback model can handle 50% of normal throughput, half the tasks are processed immediately and half are queued for later. As the backlog clears, the queue drains without creating a thundering herd when the primary service returns.
Designing for Degradation
Graceful degradation does not happen by accident. It requires deliberate architectural decisions made early in the design process.
Decouple components: tightly coupled systems fail atomically. When every feature depends on the same model API call, a single API failure disables everything. Decouple features so that each one can fail independently.
Define degradation levels: for each component, define what happens when it fails. Document the degradation explicitly: "when the vector database is down, the agent uses keyword search instead of semantic search." This documentation becomes the specification for implementing degradation handling.
Implement health-aware routing: the agent should know the current health status of all its dependencies and route requests accordingly. If the primary model is healthy, use it. If it is degraded, route only simple queries to it. If it is down, use the fallback.
Monitor degradation duration: track how long the system spends in each degradation level. Prolonged degradation indicates that recovery is not working, and the team should investigate. Use reliability metrics to set alerts on degradation duration thresholds.
Graceful degradation turns binary failures into a spectrum of reduced capability. Model fallback chains, capability-based feature disabling, semantic response caching, and queue-based deferral ensure that users always receive some value, even when the system is not operating at full capacity.