From Telecom to AI: The Erlang Reliability Story
The Problem That Created Erlang
Telephone switches in the 1980s had requirements that no existing programming language could satisfy. They needed to handle tens of thousands of simultaneous phone calls, each as an independent process. They needed to stay online 24/7 because telephone service cannot have scheduled maintenance windows. They needed to survive hardware failures, software bugs, and operator errors without affecting active calls.
Joe Armstrong, Robert Virding, and Mike Williams at the Ericsson Computer Science Laboratory began experimenting with different approaches in 1986. They tried Prolog, Lisp, and several other languages before concluding that no existing language had the right concurrency and fault tolerance primitives. They needed something new.
The language they created was named Erlang, partly after Danish mathematician Agner Krarup Erlang (who developed the theory of telephone traffic) and partly as a contraction of "Ericsson Language." The first version was implemented in Prolog in 1986. The BEAM virtual machine, which is still used today, was developed in 1992 by the time the language reached maturity.
The Key Insight: Let It Crash
The most counterintuitive and most important principle in Erlang philosophy is "let it crash." Instead of trying to handle every possible error condition within the code that encounters the error, Erlang programs are designed to crash on unexpected conditions and let a supervisor process handle the recovery.
This principle emerged from a practical observation: most error handling code is itself buggy. When engineers write complex try-catch blocks to handle every edge case, they introduce new bugs in the error handling code. The error handling becomes more complex than the business logic it protects, and it is tested far less thoroughly because error conditions are rare.
The Erlang approach is simpler: write the happy path clearly and correctly. If anything unexpected happens, crash. The supervisor detects the crash and starts a new process with clean state. The new process handles the next request correctly because it starts fresh, without any corrupted state from the failed attempt.
This works because Erlang processes are isolated. A crash in one process cannot corrupt memory in another process, because each process has its own heap. There is no shared mutable state to corrupt. The worst that can happen is that the crashed process loses its in-flight work, which is acceptable because the supervisor immediately starts a replacement.
The AXD 301: Nine Nines in Production
The proof that Erlang's approach works came with the Ericsson AXD 301 ATM switch, deployed in 1998. This system achieved 99.9999999% availability in production, a number so extreme that it translates to about 31 milliseconds of downtime per year. The system ran for years without being shut down, handling telecommunications traffic for major carriers.
The AXD 301 consisted of over a million lines of Erlang code and ran on multiple hardware nodes. When a hardware node failed, the software automatically redistributed its load to other nodes. When a software process crashed, its supervisor restarted it within milliseconds. When engineers needed to deploy a code update, they used hot code reloading to update running processes without stopping the system.
This level of reliability was not achieved through perfect code. The AXD 301 had bugs, like all large software systems. It achieved reliability through architectural patterns that contained and recovered from bugs automatically, faster than any human could respond.
From Telecom to Internet Scale
For years, Erlang remained a niche language known primarily within the telecom industry. That changed when internet companies discovered that their problems, massive concurrency, always-on requirements, and tolerance for component failures, were structurally identical to telecom problems.
WhatsApp was the most prominent early adopter. When Facebook acquired WhatsApp in 2014, the messaging service was handling 600 million users with a team of about 35 engineers. Their Erlang-based server architecture could handle 2 million TCP connections per server, a number that seemed unreasonable to engineers accustomed to Java or Python servers handling thousands.
Discord adopted Elixir (a modern language running on the BEAM virtual machine) for its real-time messaging infrastructure. Elixir gave Discord engineers the reliability and concurrency of Erlang with a more familiar syntax and a growing ecosystem of libraries. Discord handles millions of concurrent users with consistent low latency, powered by the same process model that ran telephone switches.
RabbitMQ, one of the most widely deployed message brokers, is written in Erlang. Its reliability and ability to handle massive message throughput come directly from the BEAM runtime properties. CouchDB, the distributed database, is also built on Erlang for the same reasons.
Why These Lessons Apply to AI Agents
AI agent systems share the same fundamental challenges that drove the creation of Erlang. They run many concurrent tasks (like simultaneous phone calls). They depend on unreliable external services (like unreliable network connections). They need to maintain state across failures (like call state during switch restarts). And they need to update their behavior without downtime (like deploying switch software updates).
The mapping is direct. Each AI agent is analogous to a call-handling process: independent, isolated, and supervised. The LLM API is analogous to the trunk line: an external dependency that can fail at any time. The orchestration layer is analogous to the switch controller: responsible for routing, coordination, and recovery. The tool system is analogous to supplementary services: features that enhance the core functionality but can fail independently.
Telecom engineers learned that you cannot prevent all failures in a system this complex. They learned that trying to prevent all failures makes the system more complex and therefore more failure-prone. Instead, they designed systems that expect failures and recover from them automatically, quickly, and predictably.
Applying the Lessons Without Erlang
You do not need to write your AI agent system in Erlang or Elixir to benefit from these lessons. The principles are language-agnostic, even if the implementation is more work in other languages.
Isolate components. Run each agent or subsystem in its own process, container, or at minimum its own thread with independent error handling. Do not let a crash in one component corrupt state in another.
Supervise everything. Every process should have something watching it that can restart it on failure. In Kubernetes, this is the deployment controller. In systemd, this is the service manager. In application code, this is an explicit supervisor pattern.
Let it crash. Do not write complex error recovery code for every edge case. Handle expected errors (like rate limits) explicitly. For unexpected errors, crash cleanly and let the supervisor restart with fresh state. This produces more reliable systems than attempting comprehensive error handling.
Design for hot updates. Separate configuration from code. Use external configuration files, environment variables, or configuration services that can be updated without restarting the application. Design agent prompts, tool definitions, and model endpoints to be changeable at runtime.
These principles are not theoretical ideals. They are the distilled experience of forty years of building systems that must never go down, proven in production at scales that dwarf most AI agent deployments. The engineers who built the AXD 301 solved your reliability problems before your problems existed.
Erlang's "let it crash" philosophy, proven by decades of telecom deployment with 99.9999999% uptime, applies directly to AI agent systems. The core lesson is that reliability comes not from preventing failures, but from designing systems that recover from failures automatically through isolation, supervision, and clean restarts.