State Checkpointing: Saving Agent Progress
What to Checkpoint
Not all agent state is worth saving. Effective checkpointing distinguishes between essential state that must survive a crash and ephemeral state that can be reconstructed or discarded.
Essential state includes the task definition and parameters, the list of completed steps and their results, accumulated intermediate data that is expensive to recompute, conversation history summaries, and any external side effects that have been recorded (emails sent, API calls made, files created). This state cannot be reconstructed without repeating all the work.
Ephemeral state includes cached model responses that can be regenerated, open network connections that will be re-established, in-memory indexes that can be rebuilt from stored data, and temporary computation buffers. Saving this state adds checkpoint size and complexity without meaningful benefit.
The key question for each piece of state: if this is lost, how much work must be repeated to reconstruct it? If the answer is "none" or "trivially little," it is ephemeral. If the answer is "significant work," it is essential.
When to Checkpoint
Checkpoint frequency is a tradeoff between safety and performance. More frequent checkpoints mean less data loss on crash but more overhead during normal operation. Less frequent checkpoints mean lower overhead but potentially significant data loss.
After each major step is the most common strategy. When the agent completes a discrete unit of work (finishing a tool call, completing a reasoning cycle, producing an intermediate result), it saves a checkpoint. This natural boundary approach aligns checkpoints with meaningful progress points and produces checkpoints of consistent, manageable size.
Time-based checkpointing saves at fixed intervals (every 30 seconds, every 2 minutes) regardless of what the agent is doing. This guarantees a maximum data loss window but may checkpoint in the middle of an operation, creating checkpoints that represent partial, potentially inconsistent state.
Event-triggered checkpointing saves before risky operations (before calling an external API, before executing a potentially destructive tool) so that a failure during the operation can roll back to a known good state. This provides targeted protection for the most dangerous parts of execution.
For most AI agent systems, a combination works best: checkpoint after each major step as the baseline, with additional event-triggered checkpoints before high-risk operations.
Where to Store Checkpoints
Local file system is the simplest option. Write the checkpoint as a JSON or binary file to the local disk. Fast to write and read, simple to implement, but lost if the entire machine fails. Suitable for single-machine deployments where machine failure is rare.
Redis or Memcached provides fast in-memory storage with optional persistence. Checkpoints are available across processes on the same machine and across machines in a cluster. Redis with AOF (Append Only File) persistence provides a good balance of speed and durability.
Database (PostgreSQL, SQLite) provides the strongest durability guarantees. Checkpoints survive machine failures, can be queried and analyzed, and benefit from database features like transactions and replication. The tradeoff is higher write latency compared to file or Redis storage.
Object storage (S3, GCS) provides durable, scalable storage at low cost. Suitable for large checkpoints that do not need sub-second write latency. Works well for checkpoints that are saved at coarse intervals (every few minutes) rather than after every operation.
The right choice depends on checkpoint frequency and size. High-frequency, small checkpoints (every few seconds, a few KB each) work well with Redis. Low-frequency, large checkpoints (every few minutes, many MB) work well with S3 or a database.
Checkpoint Format and Versioning
Checkpoints should be self-describing, including enough metadata that the recovery process can interpret them without external context. At minimum, include a version number, a timestamp, the agent type and task identifier, and a hash or checksum for integrity verification.
Version numbers are critical because agent code evolves. A checkpoint created by version 1 of the agent may not be compatible with version 2. The recovery process should check the version and either load the checkpoint natively, migrate it to the current format, or fall back to a cold start if migration is not possible.
Use a serialization format that is both compact and inspectable. JSON is human-readable and widely supported but verbose for large state. MessagePack or Protocol Buffers are more compact but require schema definitions. For most AI agent systems, JSON with optional gzip compression provides the best balance of debuggability and efficiency.
Conversation History Checkpoints
Conversation history deserves special treatment because it grows continuously and can become very large. A simple approach of checkpointing the full conversation history quickly becomes impractical for long-running agents.
Summarization checkpoints periodically compress the conversation history into a summary that captures the key decisions, results, and context. The summary replaces the raw history in the checkpoint, reducing size by 90% or more. On recovery, the agent loads the summary and continues with reduced but functional context.
Sliding window checkpoints keep only the most recent N messages or N tokens of conversation, discarding older messages. This bounds the checkpoint size but loses early context. Combining a sliding window with a summary of the discarded messages provides both recency and historical context.
Milestone checkpoints save the full conversation at key decision points (task start, major step completion, error recovery) and discard intermediate messages. Recovery loads the most recent milestone and replays only the messages since then.
Checkpoint Consistency
A checkpoint is consistent if it represents a state from which the agent can safely resume. Inconsistent checkpoints, those that capture a partial state from the middle of an operation, can cause bugs that are worse than losing the checkpoint entirely.
The simplest way to ensure consistency is to take checkpoints only at natural transaction boundaries: after a step is fully complete, after all side effects are recorded, after all in-memory state is synchronized. Never checkpoint in the middle of a tool call, API request, or state update.
For complex agents with multiple concurrent operations, consider using copy-on-write semantics for checkpointing. Take a snapshot of the state without blocking ongoing operations. This requires the state to be structured so that a consistent snapshot can be taken without stopping all activity.
Checkpoint Lifecycle Management
Checkpoints accumulate over time and consume storage. Without lifecycle management, an agent that runs for days or weeks can produce gigabytes of checkpoints. Implement retention policies that automatically clean up old checkpoints.
A practical retention policy keeps the most recent 3 to 5 checkpoints for immediate recovery, plus one checkpoint per hour for the last 24 hours for investigation purposes. Older checkpoints are deleted automatically. For tasks that complete successfully, all checkpoints for that task can be deleted immediately unless archival is needed for audit purposes.
Tag checkpoints with the task identifier so that cleanup policies can target completed tasks independently. A checkpoint for a completed task has no recovery value and can be deleted as soon as the task result is confirmed.
Checkpoint essential state at natural step boundaries, store it in durable storage appropriate to your checkpoint frequency, and manage checkpoint lifecycle to prevent unbounded growth. For conversation history, use summarization checkpoints that compress context without losing critical information.