AI Agent Crash Recovery: Automatic Restart Patterns

Updated May 2026
Crash recovery is the process of automatically detecting when an AI agent has failed, cleaning up any resources it left behind, restoring its state from the most recent checkpoint, and restarting it to continue its work. Effective crash recovery transforms agent failures from catastrophic events into minor interruptions, measured in seconds rather than hours of lost productivity.

The Recovery Lifecycle

Every crash recovery follows the same four phases: detection, cleanup, restoration, and restart. The speed and reliability of each phase determines how effectively the system handles failures.

Detection happens through one of several mechanisms. Process monitoring (as in supervision trees) detects crashes immediately when the process exits. Health check polling detects hung or degraded processes at configurable intervals. Timeout watchdogs detect processes that stop making progress within expected timeframes. The fastest detection comes from process monitoring, which provides near-instant notification of crashes.

Cleanup releases any resources the crashed process was holding. This includes closing database connections, releasing file locks, canceling pending API requests, and removing temporary files. Without proper cleanup, the restarted process may fail immediately because the resources it needs are still locked by the ghost of the previous instance.

Restoration loads the saved state from the most recent checkpoint. This includes task progress, conversation history summaries, intermediate results, and configuration. The quality of restoration depends entirely on the quality and recency of the checkpoint. A checkpoint from five seconds ago loses five seconds of work. A checkpoint from an hour ago loses an hour.

Restart initializes a new process instance with the restored state and resumes execution. The agent picks up its task from the checkpoint position, re-establishes connections to external services, and continues working. From the outside, the interruption may be invisible if the recovery completes quickly enough.

Cold Start vs. Warm Start

A cold start initializes the agent from scratch with no prior state. The agent loads its configuration, connects to services, builds its context from scratch, and starts the task from the beginning. Cold starts are simple and reliable, but they waste all work done before the crash.

A warm start loads the agent from a saved checkpoint, restoring its state to a recent point in time. The agent skips the work it has already completed and resumes from where it left off. Warm starts are more complex because they require checkpoint infrastructure, state serialization, and validation of the restored state.

The choice between cold and warm start depends on the cost of lost work. For short tasks (under a minute), cold start is usually fine because the cost of repeating the work is low and the complexity of checkpointing is not justified. For long tasks (minutes to hours), warm start is essential because losing an hour of completed work to repeat it from scratch is unacceptable.

A hybrid approach uses cold start for process initialization and warm start for task state. The agent process itself starts fresh with clean connections and empty caches, but loads the task state from a checkpoint. This combines the reliability of cold starts (no stale connections or corrupted process state) with the efficiency of warm starts (no lost task progress).

Restart Policies

Immediate restart starts the replacement process as quickly as possible. This minimizes downtime but risks restart storms if the crash is caused by a persistent condition. If the same bug crashes the process immediately after restart, it will crash and restart repeatedly, consuming resources without making progress.

Delayed restart waits a configurable period before starting the replacement. This gives transient conditions (like API rate limits or network congestion) time to resolve before the agent tries again. A fixed delay of 5 to 30 seconds handles most transient failures.

Exponential backoff restart increases the delay between consecutive restarts. The first restart waits 1 second, the second waits 2 seconds, the fourth waits 8 seconds, and so on. This prevents restart storms while still recovering quickly from isolated failures. A maximum delay cap (typically 5 to 10 minutes) prevents the delay from growing indefinitely.

Conditional restart examines the crash reason before deciding whether to restart. A crash due to a network timeout warrants automatic restart. A crash due to an invalid API key does not, because the same error will occur immediately after restart. Conditional restart requires structured error types that distinguish between retryable and permanent failures.

State Restoration Strategies

The most challenging part of crash recovery is restoring agent state accurately. Several strategies exist, each with different tradeoffs.

Last checkpoint restoration loads the most recent checkpoint from persistent storage. This is the standard approach and works well when checkpoints are frequent and complete. The risk is that the checkpoint may be outdated if checkpointing intervals are long, or incomplete if the checkpoint did not capture all relevant state.

Event replay stores a log of all events (tool calls, model responses, state changes) and replays them to reconstruct the agent state. This provides perfect state restoration but can be slow for agents with long event histories. It also requires that all operations be deterministic or that their outputs are captured, because replaying a tool call might produce a different result the second time.

Incremental reconstruction loads a base checkpoint and then replays only the events that occurred since that checkpoint. This combines the speed of checkpoint restoration with the accuracy of event replay, providing a practical middle ground for most production systems.

Partial restoration with re-derivation loads only the essential state (task definition, completed steps, key results) and re-derives the rest. Conversation summaries can be regenerated from key milestones. Tool configurations can be reloaded from their sources. This approach accepts some quality loss in exchange for simpler checkpointing and faster recovery.

Handling In-Flight Operations

The trickiest aspect of crash recovery is dealing with operations that were in progress when the crash occurred. An agent might have sent a request to an external API but crashed before receiving the response. Did the request succeed? Was it even received?

For idempotent operations (operations that produce the same result regardless of how many times they are executed), the safest approach is to simply retry them. Sending a GET request twice returns the same data. Looking up a record in a database twice returns the same record. The repeated operation has no negative effect.

For non-idempotent operations (operations that have side effects on each execution), you need either idempotency keys or verification steps. An idempotency key is a unique identifier sent with the request that the receiving system uses to detect duplicates. If the same key is sent twice, the second request returns the result of the first without executing again. Many payment APIs and email services support idempotency keys.

When idempotency keys are not available, the recovered agent must verify the outcome of the in-flight operation before deciding whether to retry it. Check whether the email was actually sent, whether the database record was actually created, whether the payment was actually processed. Only retry if verification confirms the operation did not complete.

Recovery Testing

Crash recovery should be tested as thoroughly as the happy path. Many systems implement crash recovery that has never actually been exercised, and failures in the recovery path are discovered only during real outages, the worst possible time to find bugs.

Chaos engineering techniques apply directly to recovery testing. Randomly kill agent processes during task execution and verify that recovery produces correct results. Inject failures at different points in the task lifecycle, during API calls, during tool execution, during state updates, to ensure recovery handles every failure point.

Test recovery with corrupted checkpoints to verify that the system degrades gracefully when the checkpoint is damaged. Test recovery with missing checkpoints to verify that the system falls back to cold start. Test recovery under load to verify that the recovery mechanism itself does not become a bottleneck when many agents crash simultaneously.

Key Takeaway

Effective crash recovery follows four phases: detection, cleanup, restoration, and restart. The choice between cold start and warm start depends on task duration. Exponential backoff restart policies prevent restart storms. And recovery should be tested as rigorously as the primary functionality, because untested recovery is unreliable recovery.