Supervisor Pattern: Managing Agent Lifecycles
Origins in Erlang/OTP
The supervisor pattern did not originate in AI. It was developed in the 1980s for the Erlang programming language, which powers telecommunications systems that require extreme reliability. Ericsson's AXD 301 ATM switch, one of the first major Erlang deployments, achieved nine nines of availability (99.9999999%) using supervision trees. The pattern has since been adopted by Elixir, Akka, and numerous distributed systems frameworks.
The core insight behind the pattern is that failures are inevitable, so the system should be designed to handle failures rather than prevent them. Instead of writing defensive code that tries to anticipate every possible error condition, you let processes crash when they encounter problems and rely on supervisors to detect the crash and take corrective action. This approach produces simpler, more maintainable code because individual workers do not need complex error handling logic. The error handling is centralized in the supervisor, where it can be implemented once and applied consistently to all workers.
Applied to AI agents, this philosophy is especially powerful because agent failures are unpredictable. A language model might generate a malformed tool call. An API might return unexpected data. A reasoning chain might enter a loop. These failures are difficult to anticipate and even harder to handle within the agent's own logic. A supervisor that monitors for these conditions and responds with well-defined recovery strategies is more robust than an agent that tries to handle every possible failure internally.
Supervisor Responsibilities
A supervisor agent has four primary responsibilities: creation, monitoring, recovery, and lifecycle management.
Creation involves spawning worker agents with the right configuration for the task at hand. The supervisor decides which type of worker to create, what prompt to use, which tools to enable, what resource limits to apply, and what initial context to provide. In a sophisticated system, the supervisor may choose different worker configurations based on the task type, current system load, or historical performance data. A task that requires careful reasoning might get a worker configured with a larger model and a generous token budget. A routine task might get a smaller, faster, cheaper worker.
Monitoring is the continuous observation of worker health and progress. The supervisor tracks whether workers are making progress toward their assigned goals, whether they are consuming resources within expected bounds, whether they are producing outputs that meet quality standards, and whether they are still responsive. Monitoring can be passive (the supervisor checks worker state periodically) or active (workers send heartbeat messages and progress reports to the supervisor). Active monitoring detects problems faster but adds communication overhead. Passive monitoring is simpler but may have longer detection latency.
Recovery is the corrective action the supervisor takes when monitoring detects a problem. The recovery strategy depends on the type of failure and the supervisor's configuration. Common recovery strategies include restarting the failed worker with the same task, restarting with modified parameters (a different model, more context, relaxed constraints), reassigning the task to a different worker type, retrying after a delay, and escalating to a human when automated recovery fails. The supervisor may also need to clean up state left behind by the failed worker, notify other agents that depend on the failed worker's output, and adjust downstream plans to account for the delay.
Lifecycle management encompasses the full arc from worker creation through task completion to clean shutdown. The supervisor ensures that workers are properly initialized before receiving tasks, that their resources are properly released when they finish, and that in-flight work is handled appropriately during system shutdown. Graceful shutdown is particularly important: when the system needs to stop, the supervisor should give active workers time to complete their current tasks or checkpoint their progress rather than terminating them immediately and losing work.
Restart Strategies
The restart strategy determines how the supervisor responds when a worker fails. Different strategies suit different situations, and choosing the right one is a critical design decision.
One-for-one restart restarts only the failed worker. Other workers continue running unaffected. This strategy is appropriate when workers are independent and a failure in one worker does not affect the others. If three agents are processing three unrelated tasks, a failure in any one of them should not disrupt the others.
One-for-all restart restarts all workers when any single worker fails. This strategy is appropriate when workers have shared state or tight dependencies that make it unsafe to continue with a partial set. If a research agent, a synthesis agent, and a writing agent are all working on the same document and the research agent fails, the synthesis and writing agents may be working with incomplete or corrupted information. Restarting all of them ensures a clean slate.
Rest-for-one restart restarts the failed worker and all workers that were started after it. This strategy is appropriate for ordered dependencies: if worker C depends on worker B which depends on worker A, and worker B fails, then worker C (started after B) is also restarted, but worker A (started before B) continues. This strategy is common in pipeline architectures where downstream stages depend on upstream stages.
Each restart strategy can also include intensity limits that prevent infinite restart loops. If a worker fails and is restarted five times within one minute, the supervisor should stop restarting and escalate the problem rather than continuing to restart a fundamentally broken worker. The intensity limit (maximum restarts within a time window) is tuned based on the expected failure rate and the cost of restarting.
Supervision Trees
In complex systems, supervisors themselves can be supervised, creating a tree structure. A top-level supervisor manages several mid-level supervisors, each of which manages a group of workers. This hierarchy provides isolation between subsystems: a failure in one branch of the tree does not affect other branches.
Consider an agent system that handles both customer support and content creation. A top-level supervisor manages two mid-level supervisors: one for support agents and one for content agents. If the content subsystem encounters a catastrophic failure that requires restarting all content workers, the support subsystem continues operating normally. The mid-level content supervisor handles the restart within its branch, and the top-level supervisor only intervenes if the mid-level supervisor itself fails.
Supervision trees also enable different restart strategies at different levels. The top-level supervisor might use one-for-one restart because the support and content subsystems are independent. The content supervisor might use one-for-all restart because its workers share state. The support supervisor might use one-for-one restart because each support agent handles independent conversations. The tree structure lets you apply the right strategy at the right level.
The depth of the supervision tree should reflect the actual structure of your system. Deep trees provide fine-grained failure isolation but add complexity. Shallow trees are simpler but provide coarser isolation. Two to three levels cover most production systems. Going beyond four levels typically indicates a system that would benefit from decomposition into separate services rather than a deeper tree.
Implementing Supervisors for AI Agents
AI agent supervisors differ from traditional process supervisors in several important ways. Traditional supervisors monitor process health through simple mechanisms like heartbeats and exit codes. AI agent supervisors need to assess the quality and progress of cognitive work, which is fundamentally harder.
Progress monitoring requires domain-specific heuristics. A code agent that has not committed any changes in 15 minutes might be stuck, or it might be working on a particularly complex function. A research agent that has made 50 search queries without synthesizing any results is probably in a loop, but there is no universal rule for what number of queries is too many. Supervisors need configurable thresholds that can be tuned based on experience with specific task types.
Quality assessment is even harder than progress monitoring. A supervisor needs to determine whether a worker's output meets the required standard. For some tasks, quality can be checked programmatically: does the code compile, do the tests pass, is the output valid JSON. For others, quality assessment requires another LLM call, which adds cost and introduces its own potential for errors. Some production systems use a lightweight model for quality checks to keep costs manageable while reserving larger models for the actual work.
Context preservation during restarts is critical for AI agents. When a traditional process is restarted, it typically begins with a clean state. When an AI agent is restarted, it should ideally resume from where it left off rather than starting the entire task over. This requires checkpointing: periodically saving the agent's progress so that a restarted agent can load the checkpoint and continue. The granularity of checkpointing involves a tradeoff between restart efficiency (more checkpoints mean less rework after a restart) and runtime overhead (each checkpoint costs time and storage).
Supervisor Anti-Patterns
Several common mistakes undermine the effectiveness of the supervisor pattern.
Supervisors that do work. A supervisor that also processes tasks mixes management concerns with execution concerns. When the supervisor is busy processing a task, it cannot monitor its workers. When it detects a worker failure, it cannot respond immediately because it is mid-task. Keep supervisors focused exclusively on management. Their only job is creating, monitoring, and recovering workers.
Missing escalation paths. A supervisor that restarts workers indefinitely without escalating persistent failures will consume resources without making progress. Every supervisor needs a defined escalation path: after N failed restarts within a time window, stop restarting and alert a human or a higher-level supervisor.
Overly aggressive monitoring. Checking worker health every second creates communication overhead that slows down actual work. Checking every ten minutes means failures go undetected for far too long. The monitoring interval should be proportional to the expected task duration and the cost of delayed detection. For agents handling real-time customer interactions, check every few seconds. For agents running batch analysis jobs, checking every minute or two is sufficient.
Restarting without diagnosis. Blindly restarting a failed worker without understanding why it failed often leads to the same failure repeating. Supervisors should log the failure context, and if the same failure recurs, they should adjust the restart strategy: provide additional context, change model parameters, or reassign to a different worker type.
The supervisor pattern turns agent failures from system-ending events into routine incidents that are detected and recovered automatically. Borrowed from decades of Erlang/OTP production experience, it provides the reliability guarantees that make multi-agent systems viable for production workloads.