Supervision Trees: The Foundation of Reliable AI
How Supervision Trees Work
A supervision tree has two types of nodes: supervisors and workers. Workers do the actual work, processing tasks, calling APIs, executing tools. Supervisors do nothing except watch their children and respond when something goes wrong. This separation of concerns is critical. The code that does the work does not need to worry about recovery, and the code that handles recovery does not need to understand the work.
When a worker process crashes, its supervisor receives a notification containing the process identifier and the reason for the crash. The supervisor then decides what to do based on its configured restart strategy. In the simplest case, it starts a new instance of the worker with a clean initial state. The new worker takes over where the old one left off, or starts its current task from the beginning, depending on whether state checkpointing is in place.
Supervisors can themselves be children of higher-level supervisors, creating a tree structure. A root supervisor manages subsystem supervisors, each of which manages a group of workers. This creates a layered defense: if a worker crashes, its direct supervisor handles it. If the supervisor itself crashes (because too many children are failing), its parent handles it. Failures propagate upward only as far as necessary to find a supervisor that can resolve the situation.
Restart Strategies
The restart strategy is the policy that determines how a supervisor responds when a child fails. Three standard strategies cover most use cases.
One-for-one restarts only the child that crashed, leaving all other children running. This is the most common strategy and is appropriate when children are independent of each other. In an AI agent system, if you have five parallel agents processing different tasks, a crash in one should not affect the others.
One-for-all restarts all children when any single child crashes. This strategy is used when children share state or have mutual dependencies that become inconsistent after a partial failure. For example, if three agents share an in-memory coordination structure, a crash in one agent might corrupt that structure, making it unsafe for the others to continue.
Rest-for-one restarts the crashed child and all children that were started after it, leaving earlier children running. This is used when children have ordered dependencies, where child B depends on child A, and child C depends on both. If B crashes, C must also restart because its dependency is gone, but A can continue.
Restart Intensity and Escalation
Supervisors implement restart intensity limits to prevent infinite restart loops. A typical configuration might allow a maximum of five restarts within sixty seconds. If a child crashes and restarts six times in that window, the supervisor determines that something is fundamentally wrong, not a transient issue that restarts can fix.
When the restart intensity limit is exceeded, the supervisor itself terminates. This causes the supervisor parent to detect the failure and apply its own restart strategy. The escalation continues upward until either a supervisor successfully handles the situation or the root supervisor terminates, shutting down the entire system.
This escalation mechanism prevents localized problems from consuming unlimited resources while ensuring that truly unrecoverable failures are surfaced to operators. If the root supervisor shuts down, external process monitors (like systemd, Docker restart policies, or Kubernetes liveness probes) can restart the entire application.
Applying Supervision to AI Agents
In an AI agent system, the supervision tree maps naturally onto the orchestration architecture. A typical structure might include a root supervisor managing three subsystem supervisors: one for task management, one for tool execution, and one for communication.
The task management supervisor oversees individual agent worker processes, each handling a separate task. The tool execution supervisor manages tool adapter processes for web scraping, database queries, file operations, and API integrations. The communication supervisor manages input/output channels including API endpoints, webhook handlers, and notification services.
When the web scraper process crashes because a target website returned unexpected HTML, the tool execution supervisor restarts it with a clean browser session. The task management agents continue working, using cached results or waiting briefly for the scraper to return. No human intervention is needed, and no other subsystem is affected.
When an agent worker gets stuck in an infinite loop and exceeds its step limit, the task management supervisor terminates it and starts a fresh agent for the same task. The new agent can load the task checkpoint and resume from the last successful step, or start over with a modified approach based on the failure reason.
Designing Supervisor Hierarchies
The key design decision in supervision trees is how to group processes under supervisors. The grouping determines the blast radius of failures, which processes are affected when something goes wrong.
Group by failure domain: processes that share a common failure cause should be under the same supervisor. If all agents that use the OpenAI API might fail simultaneously during an API outage, they belong under a supervisor that can handle API-wide failures, perhaps by switching all agents to an alternative provider.
Group by restart cost: processes that are expensive to restart (because they carry significant state or require lengthy initialization) should be separated from processes that are cheap to restart. This prevents a cheap process crash from forcing an expensive restart through a one-for-all strategy.
Keep the tree shallow: deep nesting adds latency to failure detection and recovery. Three to four levels are sufficient for most systems. Each level should represent a meaningful architectural boundary, not an arbitrary subdivision.
Supervision in Different Languages
Erlang and Elixir provide supervision trees as a built-in language feature with decades of production hardening. The BEAM virtual machine supports lightweight processes (millions per machine) with isolated memory spaces, making supervision both natural and efficient.
In Python, supervision must be implemented manually or through frameworks. The multiprocessing module provides process isolation, and libraries like pykka offer actor-model abstractions with supervision capabilities. However, Python processes are heavyweight compared to BEAM processes, limiting the granularity of supervision.
In Go, goroutines provide lightweight concurrency, and supervision patterns can be implemented using goroutine monitoring, context cancellation, and channel-based communication. Libraries like suture provide Erlang-style supervision for Go programs.
In Kubernetes environments, the pod restart policy and controller reconciliation loop provide infrastructure-level supervision. Each pod is a supervised worker, and the deployment controller acts as the supervisor. This is coarser-grained than process-level supervision but requires no application-level implementation.
Common Mistakes
The most common mistake is putting too much logic in the supervisor. Supervisors should do nothing except monitor and restart. The moment a supervisor starts processing data, managing state, or making business decisions, it becomes a single point of failure.
Another common mistake is using overly aggressive restart policies. Restarting a process immediately after every crash, with no cooldown or backoff, can create restart storms that consume CPU and memory without making progress. Restart intensity limits exist specifically to prevent this.
A third mistake is failing to clean up resources before restarting. When a process crashes, it may leave behind open file handles, network connections, temporary files, or database locks. The supervisor should ensure these resources are released before starting the replacement process.
Supervision trees separate the concerns of doing work from handling failure. By organizing agent processes into monitored hierarchies with clear restart strategies, you create systems that automatically recover from crashes, contain failures within boundaries, and escalate only when truly necessary.