AI Agent Production Deployment Checklist
This checklist is organized by category. Each item can be marked as complete, partially complete, or not applicable. Items marked as critical must be completed before production deployment. Items marked as recommended should be completed as soon as practical after initial deployment.
Process Architecture (Critical)
Verify that each agent runs in an isolated process (container, OS process, or BEAM process) with its own memory space. Confirm that a crash in one agent does not affect other agents. Test this by deliberately crashing one agent while others are processing tasks.
Verify that supervision is configured for every agent process. Confirm that supervisors detect crashes within seconds and restart the failed agent automatically. Verify restart intensity limits are set (maximum 5 restarts per 60 seconds is a reasonable default).
Verify that all inter-agent communication uses message passing or API calls, not shared memory. Confirm that every inter-agent call has an explicit timeout configured. Test what happens when an agent is slow to respond or not responding at all.
Error Handling (Critical)
Verify that every external API call (LLM, tools, databases) has retry with exponential backoff configured. Confirm that retries use jitter to prevent thundering herd. Verify that non-retryable errors (401, 403, 400) are not retried.
Verify that circuit breakers are configured for each external dependency. Confirm that breakers trip when error rates exceed thresholds. Verify that breakers recover through half-open state when the service returns. Test by temporarily blocking network access to each dependency.
Verify that all operations have explicit timeouts. No operation should be able to hang indefinitely. Confirm timeout values are appropriate for each operation type (2 to 5 seconds for health checks, 30 to 120 seconds for model API calls, 5 to 30 seconds for tool calls).
Verify that error messages are logged with sufficient context for debugging. Confirm that errors are classified (retryable vs. permanent, transient vs. systemic). Verify that error rates are tracked as metrics.
State Management (Critical)
Verify that state checkpointing is implemented and active. Confirm that checkpoints are saved after each significant step. Verify that checkpoint storage is durable (survives process restarts and, ideally, machine failures).
Verify that checkpoint restoration works correctly. Test by killing an agent mid-task and confirming it resumes from the checkpoint. Verify that restored agents produce the same results as uninterrupted agents for the remaining steps.
Verify that conversation history is managed to prevent unbounded growth. Confirm that summarization or sliding window compression keeps context within model limits. Test long-running tasks (50+ steps) to verify that context management works at scale.
Verify that operations with side effects are idempotent or guarded by deduplication keys. Confirm that a crash and restart does not cause duplicate emails, duplicate payments, or duplicate database writes.
Graceful Degradation (Recommended)
Verify that model fallback chains are configured. Disable the primary model and confirm the agent switches to the fallback automatically. Verify that fallback model output is acceptable quality for your use case.
Verify that the agent handles tool unavailability gracefully. Disable each tool individually and confirm the agent either uses an alternative, skips the capability, or informs the user, rather than crashing.
Verify that degradation events are logged and monitored. Confirm that prolonged degradation triggers alerts. Track degradation duration as a reliability metric.
Monitoring and Alerting (Critical)
Verify that health check endpoints (liveness and readiness) are implemented and connected to the orchestration platform. Confirm that failed health checks trigger automatic restarts.
Verify that key metrics are instrumented: model API latency and error rates, tool execution success rates, task completion rates, and resource usage (CPU, memory, connections).
Verify that alerting rules are configured for both warning and critical thresholds. Test each alert by deliberately triggering the condition. Confirm that alerts reach the right people through the right channels.
Verify that a health dashboard exists showing business metrics, system metrics, and infrastructure metrics. Confirm that the dashboard loads quickly and updates in near-real-time during incidents.
Verify that structured logging is implemented with consistent fields across all components. Confirm that logs are sent to a centralized system and can be searched by agent ID, task ID, and time range.
Security (Critical)
Verify that API keys and credentials are stored in a secrets manager (not in code, not in environment variables on shared systems). Confirm that credentials are rotatable without downtime.
Verify that agent inputs are validated and sanitized. Confirm that prompt injection attacks cannot cause the agent to perform unauthorized actions. Test with known prompt injection patterns.
Verify that agent outputs are filtered for sensitive data (PII, credentials, internal system information). Confirm that logs do not contain sensitive data. Verify that tool calls are scoped to the minimum necessary permissions.
Verify that rate limits are configured to prevent runaway cost. Set maximum API spend per task, per agent, and per day. Confirm that exceeding limits triggers alerts and stops the agent rather than silently accumulating charges.
Load and Chaos Testing (Recommended)
Run load tests at expected peak traffic (2 to 3 times average load). Verify that the system handles peak load without degradation, or degrades gracefully within defined parameters.
Run chaos tests that randomly kill agent processes, block network access to dependencies, and corrupt checkpoint files. Verify that the system recovers automatically from each failure type.
Run sustained load tests for at least 24 hours to detect resource leaks, memory growth, and connection pool exhaustion. Monitor resource metrics during the test and set alerts for any that show upward trends.
Document the results of all tests, including any failures discovered and the remediations applied. This documentation forms the baseline for future regression testing.
Production readiness requires verification across seven categories: process architecture, error handling, state management, graceful degradation, monitoring, security, and load testing. Complete the critical items before deployment and the recommended items as soon as practical afterward. Re-run this checklist after every major system change.