Supervision Trees for Multi-Agent Coordination
The Actor Model Foundation
Supervision trees originate from the Erlang actor model, which has powered some of the most reliable distributed systems in history, including telephone switching networks that achieve 99.9999999 percent uptime and messaging platforms that handle billions of messages daily. The core insight is that failures in distributed systems are inevitable, so the system should be designed to handle failures gracefully rather than trying to prevent them entirely.
In the actor model, each process (or in the case of AI systems, each agent) is an isolated unit that communicates with other units through message passing. If a process crashes, only that process is affected because it shares no state with other processes. Its supervisor detects the crash and decides how to respond, typically by restarting the failed process with a clean state. This isolation and supervision structure translates directly to multi-agent AI systems, where each agent runs as an independent LLM invocation that can fail without bringing down the entire system.
The key principle is to let it crash. Rather than wrapping every operation in defensive error handling, agents are allowed to fail fast when they encounter unexpected conditions. The supervision layer handles recovery, which keeps the agent code simple and focused on its primary task rather than cluttered with error handling logic. This philosophy produces cleaner agent prompts and more predictable behavior because agents either succeed completely or fail clearly, avoiding the problematic middle ground where an agent partially succeeds and produces subtly incorrect output.
Restart Strategies
When an agent fails, its supervisor must decide how to respond. The three standard restart strategies from the Erlang model are one-for-one, one-for-all, and rest-for-one. Each strategy is appropriate for different relationships between the supervised agents.
One-for-one restarts only the failed agent, leaving all other agents under the same supervisor untouched. This is the default strategy and works well when agents are independent and do not share state. If a research agent fails while gathering information from one source, the supervisor restarts it with the same task while other research agents continue working on their respective sources undisturbed. One-for-one is the most common strategy in multi-agent AI systems because most agent architectures deliberately minimize shared state between agents.
One-for-all restarts all agents under the same supervisor when any one of them fails. This strategy is appropriate when agents share state that becomes inconsistent after a partial failure. If three agents are collaboratively editing a shared document and one of them crashes mid-edit, the document may be in an inconsistent state. Restarting all three agents and rolling back to the last consistent checkpoint ensures the system recovers to a known good state. This strategy is more disruptive but necessary when agent work is tightly coupled and partial results cannot be trusted.
Rest-for-one restarts the failed agent and all agents that were started after it, maintaining initialization order dependencies. This is useful when agents have startup dependencies: if Agent C depends on Agent B which depends on Agent A, and Agent B fails, both Agent B and Agent C must be restarted in order while Agent A continues running. This strategy preserves the work of agents that do not depend on the failed agent while ensuring dependent agents are properly reinitialized.
Choosing the right restart strategy requires understanding the state dependencies between agents. Map out which agents share state, which agents depend on the output of other agents, and which agents are fully independent. Independent agents use one-for-one. Agents with shared mutable state use one-for-all. Agents with ordered initialization dependencies use rest-for-one.
Output Validation
Beyond crash detection, supervisors can also validate the quality of agent outputs before accepting them. A supervisor agent might check whether a research agent's output contains factual claims with cited sources, whether a writing agent's output meets length and formatting requirements, or whether a code agent's output compiles and passes basic tests.
Output validation catches a category of failures that crash detection misses: cases where the agent completes successfully but produces incorrect or low-quality output. An agent might hallucinate facts, produce output in the wrong format, or generate content that contradicts previous decisions made by other agents. These silent failures can be more damaging than crashes because they propagate incorrect information through the system without triggering any error signals. A research agent that confidently cites a non-existent study creates a downstream chain of errors that is difficult to trace back to its origin.
Implementing output validation requires defining clear quality criteria for each agent type. Research agents might be validated on source citation quality, factual consistency with known facts, and completeness of coverage. Writing agents might be validated on word count, readability score, adherence to style guidelines, and logical coherence between sections. Code agents might be validated through automated testing, static analysis, and linting. The validation criteria should be specific, measurable, and automated wherever possible to avoid introducing human bottlenecks into the workflow.
When validation fails, the supervisor has several options: retry the same agent with feedback about what was wrong, escalate to a more capable agent using a higher-tier model, or route the task to a fallback agent with a different approach. The retry option is usually tried first because many quality failures are non-deterministic and a second attempt with feedback often produces acceptable results. If multiple retries fail, escalation to a stronger model or a different agent typically resolves the issue.
Health Monitoring
Proactive health monitoring allows supervisors to detect degradation before it becomes a failure. Key health indicators for AI agents include response latency (is the agent taking longer than usual to respond, possibly indicating it is struggling with the task), token consumption (is the agent using more tokens than expected, suggesting it is producing unnecessarily verbose output or engaging in circular reasoning), output quality trends (is the agent's output quality declining over consecutive invocations, indicating prompt drift or model degradation), and error rates (is the agent encountering more tool call failures or API errors than baseline).
When health indicators cross predefined thresholds, the supervisor can take preemptive action: switching the agent to a more capable model, reducing its workload by routing some tasks to other agents, adjusting the agent's prompt to provide more guidance, or proactively restarting it before a full failure occurs. This proactive approach reduces the frequency and impact of agent failures compared to purely reactive supervision that only responds after a crash has occurred.
Trend-based monitoring is particularly important for detecting gradual degradation that does not trigger absolute threshold alerts. An agent's average quality score might decline by one percent per day, which is invisible in individual task reviews but produces a significant quality drop over weeks. Monitoring quality trends over rolling windows catches this gradual degradation early, allowing prompt adjustments or model changes before the cumulative impact becomes noticeable to end users.
Implementing Supervision in Practice
Most multi-agent frameworks provide supervision primitives that can be composed into supervision trees. LangGraph supports conditional routing based on agent outputs, enabling validation and retry logic within the execution graph. CrewAI provides task-level callbacks that can be used for output validation and retry. AutoGen supports supervisory agents that can interrupt, redirect, or restart group conversations.
For production systems, the supervision tree should mirror the agent hierarchy. The top-level orchestrator is supervised by a system-level monitor that handles orchestrator failures. Department-level managers are supervised by the orchestrator. Worker agents are supervised by their respective managers. This hierarchical supervision ensures that failures at any level have a clear escalation path and recovery mechanism.
A practical starting point is to implement two layers of supervision: a worker supervisor for each agent team that handles retry logic and quality validation, and a system supervisor that monitors overall system health and handles failures in the worker supervisors themselves. This two-layer structure provides robust fault tolerance without the complexity of deeply nested supervision hierarchies. As your system grows and failure patterns become better understood, you can add additional supervision layers for specific agent teams that need more sophisticated failure handling.
Design supervision trees with the let-it-crash philosophy: keep agents simple and focused, use one-for-one restart as the default strategy, implement output validation to catch silent failures, monitor agent health proactively to prevent degradation, and structure the supervision hierarchy to mirror your agent organization.