Erlang and Elixir to build fault-tolerant systems

Erlang is a concurrency-oriented programming language, that aims to make programming distributed, fault-tolerant systems feasible and enjoyable. This sounded interesting to me because we are living in a more and more distributed (microservices) world and I started looking at the language and its modern cousin Elixir. Here is what I learned about creating distributed, fault-tolerant systems.

Distributed systems are a pain in the ass. My favorite definition of a distributed system is this one from Leslie Lamport from 1987:

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

Distribution brings with it problems like random failures in parts of the system, that would never show up in a centralized system. Erlang is a language that promises to make build fault-tolerant systems easier, so I thought there probably are some lessons in there I can transfer to more commonly-used languages.

After going through the book Programming Elixir I was left wondering about some of the design decisions in Erlang, most notably

How are immutable data structures connected to fault-tolerance?
Why Actors to model the world?
Why should you “Let it crash” instead of handling errors where they occur?

Luckily the designer of the language, Joe Armstrong, wrote it all up in his doctoral thesis that was written years after Erlang became popular (2003) and there is an SE Radio interview with Joe Armstrong. Here are my insights:

A boring language
Building a fault-tolerant system in the presence of errors
Modeling a concurrent world
Ensuring a valid state to recover from
Ensuring recoverability from outside
Why my process should crash so my program is more reliable
- Let it crash in mainstream languages
Conclusion
Add-on: Simple concepts for a complex world

A boring language

Erlang looks the way it does, because Joe liked Prolog when he was starting to solve the problem of fault-tolerant computing in the presence of errors. Initially he implemented actors for Prolog, but soon hit barriers in the language and only then decided to create one himself. On page 30 of his thesis there is a passage that nicely characterizes the sequential part of the language, the part where, so to speak, most of the characters go:

The bulk of the language, and possible the least interesting part of the language is the sequential subset of the language. This sequential subset can be characterized as a dynamically typed, strict functional programming language, which is largely free from side-effects. In the sequential subset there are a few operations with side-effects, but they are virtually never needed.

After reading the examples in the thesis I can concede: Erlang as a language is very simple and thus easy to learn.

Building a fault-tolerant system in the presence of errors

This question has been answered in 1985, in this paper by TANDEM Computers: Why do computers stop and what can be done about it?

It is “an analysis of failure statistics of commercially available fault-tolerant systems” showing that “administration and software” are the major contributors to failure and most faults are transient, i.e. disappear after a system restart. The basic conclusion is that, while hardware failures are pretty well understood and can be mitigated, software and humans are far more likely to crash a system. The authors pessimistically state that hardware is likely to get more reliable, but software and humans are not. Applying the solutions that work for hardware to software yields the following principles (p. 15) for building fault-tolerant systems in the presence of errors:

Software modularity through processes and messages (Actor model)
Transaction mechanisms to provide data and message integrity (Immutability)
Process-Pairs to tolerate hardware and transient software faults (Supervision)
Fault containment through fail-fast software modules (Let-it-crash philosophy)

In brackets you can see which design decisions of Erlang the principles lead to. Let’s look at each of them in detail.

Modeling a concurrent world

When you write something that interacts with the real world, you have to somehow use a model to represent the real world in the programming language. One model to that is prevalent in most programming languages is that things happen sequentially, one after another. This is because obviously it is the way CPUs work, processing one instruction at a time. It does not really fit the real world though: While sitting here and typing this blog post, my body simultaneously breathes, circulates blood, types and processes visual and audio input from my surroundings. In the real world, things very rarely happen one after another and most of the time, very many different things happen at the same time. Because of the assumption of a sequential world, most programming languages provide means to make it easy to program sequentially, and programming concurrently is only an additional functionality that is not the primary goal. This is why Joe saw the need for a concurrency oriented programming language (page 20), that makes it easy to model things that happen concurrently.

The thesis contains a short design guide how to model a part of the real world using this assumption (page 21):

Identify all truly concurrent activities
Identify all message channels between activities
Write down all the messages that can flow on the channels
Implement the program using a 1:1 mapping between concurrent activities and independent processes

This approach leads to modeling the world using Actors that act independently and concurrently and communicate by passing message with no shared state. I admit that in Java, this would be rather difficult and I would rather model my domain around classes and state, with the interactions happening sequentially avoiding concurrency altogether. Erlang has several features to make this easier:

processes are run by the Erlang VM (not the operating system) in complete isolation from each other
one failing process can never interfere with others
processes can only communicate via message passing, shared state is not possible
processes are cheap: creating hundreds of thousands (to represent real-world concurrency) is possible even on modest machines

The Actor model makes no assumptions whether state is immutable or not (like in Akka). Why did Erlang choose immutability?

Ensuring a valid state to recover from

If a computer that runs a process in our system fails catastrophically, there is no way for inside the computer to recover itself or any of its state. The only way that works (and was tested thoroughly in hardware development) is to have another machine supervising the process, to notice the failure and recover the machine. The rebooted process will need a known valid state to restart from, and this is trivially easy with immutable state: simply copy it over. With mutable state the same is very hard to achieve, because of possible dangling pointers, data structures that have been modified by different parts of the system etc.

The next question is, how do I enable my system to be recovered from the outside?

Ensuring recoverability from outside

In a way Erlang lives in the days of Egyptian pharaohs building the pyramids: Supervisor oversee the work, but do not do any themselves. Workers do the actual work, and if one of them dies (e.g. because a huge stone fell on them), the supervisor just starts a new one (from a presumably endless supply of slaves).

The design rationale is the same as for immutability: If the machine a process is on fails, there is no way to recover from inside the machine, only outside. If the worker fails, the supervisor running on a different machine is notified and can act on the error. Since most faults are transient, restarting the worker from a known good state, solving the problem in the vast majority of cases.

This requires a very specific design philosophy when structuring the application (page 118): All work is structured hierarchically, with a supervisor at the root and hierarchies of supervisors supervising part of the work. The farther down the tree you go, the simpler and more concrete the work becomes. If a worker is unable to perform its work, the supervisor may decide to try a simpler alternative, or restart. If all alternatives are exhausted and the lowest level task can not be performed, the system fails (together with an informative error message).

For this supervision hierarchy to be work, the processes must not work defensively and avoid failing, but instead should either do what they intend to do, or fail. Wait what?

Why my process should crash so my program is more reliable

For me this was the weirdest part about Erlang: Why should it help with fault-tolerance, if my code throws its feet up in the air when encountering an error, instead of trying to handle it? This has at its root the problems how programmers react to unspecified behavior and interference between software parts.

You can look at the problem this way: What makes processes crash in the first place? Undefined behavior. Undefined behavior can be a) because the run-time system does not know how to handle a situation, e.g. a file it is supposed to open does not exist. Or b) because the programmer does not know how to handle a situation, because the specification does not tell her how to, because it is incomplete. In case a) you want the programmer to be able to catch the exception and act on it, because she anticipated it.

In case b), it is very dangerous for the programmer to handle the error where it occurs. In most cases, there is no way to decide which behavior should be the correct one on the spot, and programmers work from (often wrong) assumptions. From page 126

“Unfortunately many programmers take this as an opportunity for creative guess-work, and try to guess what the designer ought to have said.”

In Erlang, if a programmer encounters a situation where an error is expected to happen as per specification, this can of course be handled and the program continues its job ^[1]. If a programmer encounters an unspecified situation, she should not employ creative guess-work how it could be handled, but instead should make the program signal the error to its supervisor and terminate. Thanks to supervision, this termination does not crash the system, but can be corrected by the supervisor, e.g. by restarting the process, resolving to a simpler strategy, or failing as well. As was already noted in the TANDEM paper, most software faults are transient, so a simple restart of the process fixes most errors.

This runs completely contrary to what I would normally do in Java, where every uncaught exception risks to bring down the whole system. In Erlang this is no issue, because all work is managed by supervisors and terminating one process does not affect any of the others.

Armstrong gives an example of this philosophy on page 151, where a message buffer gets a too short expected length of a message and the buffering logic does not handle the case where the message is longer than the buffer:

collecting(Buff0, {Need, Len, Buff1}) ->
    L = length(Buff0),
    if
        L + Len < Need ->  #continue collecting if we expect more
            {next_state, collecting,
                {Need, Len+L, Buff1++Buff0}};
        L + Len == Need -> # all collected, finish
            Buff = Buff1 ++ Buff0,
            io:format("Got data:~s~n", [Buff]),
           {next_state, waiting, nil}
        # see, no error handling for too-long message 
end.

This highlights the conceptual problem: How is the program supposed to handle this case? Should we silently cut off the extraneous characters? Or extend the buffer? Or throw an error message, but which one? Instead of making exactly this ad-hoc decision, we just do not specify the behavior, and when the unspecified behavior is encountered the following happens:

Packet assembler starting
=ERROR REPORT==== 3-Jun-2003::12:38:07 ===
** State machine my_simple_packet_assembler terminating
** Last event in was "oops"
** When State == collecting
**
Data == {3,0,[]}
** Reason for termination =
** {if_clause,[{packet_assembler,collecting,2},
        {gen_fsm,handle_msg,7},
        {proc_lib,init_p,5}]}

Compare this to Java’s Stacktrace hell! Without any effort on the programmer’s part, the language (environment) gracefully handled the terminating worker by restarting it, gave us very insightful information on the state and the incoming data. You can imagine how even in a large production system this already narrows the scope for error location a lot.

Not programming defensively has the added advantage that the code need only cover the “happy path”. The code above covers only the “business logic”. Compare this to Java where sometimes 1 line of useful code is buried beneath 10 lines of null-checking and data-sanity asserting guards to prevent system crashes.

This example is from a toy project Joe uses to explain the concept of fail-fast processes, but does this also work in production? To find evidence that it does, he looked into production data from the AXD301 software and found an error report (page 177) of a process for performance measurement, that receives an unknown hardware code and subsequently crashes:

...
error_info: {function_clause,[{etcpPrm,get_hwm_base,[ne_cv_l_ga,et2]}, ... }
...

The error report contains precisely the location (module etcpPrm, function get_hwm_base) and input ([ne_cv_l_ga, et2]`) that led to the crash. Joe trivially verifies it is because a pattern matching clause does not handle the hardware code. So this thing is crashing in a production system and I really love how the engineers react to it:

Priority: C:3 M, Minor fault or opinion: no traffic disturbance … The fault has been detected, but it will not be released in R7D unless it is important for a customer.

Joe also looked at the code and was delighted to see it looks exactly like recommended in the example above: Not defensive, not containing any special handling for unknown hardware codes or other unspecified inputs. Without any error handling logic, Erlang not only survives the crash, but also produces an informative error message that helps to easily handle failure.

Let it crash in mainstream languages

The TANDEM paper already points out the biggest advantage hardware designers have over software developers: Hardware designers have made good experience with letting components fail-fast, signaling an error and simply stopping to work. It is a culturally accepted and expected technique in hardware design. In his thesis Armstrong gives a few slogans for this philosophy for software development (page 104):

Let some other process do the error recovery.
If you can’t do what you want to do, die.
Let it crash.
Do not program defensively.

The TANDEM paper noted in 1985 that fail-fast programming is not very common in software systems, because “conventional languages do not permit different software modules to co-exist in such a way that there is no interference between modules.” This is true on the JVM.

Conclusion

Reading the thesis and learning about the design decisions and rationales for creating Erlang the way it is, was very insightful. I do not know if I will ever use Erlang (or its modern cousin Elixir) in production, but I take away that

Fault-tolerant systems need distribution
Do not try to handle faults at low levels where no appropriate action can be taken
Most faults are transient and if I design my system to expect and handle failing parts overall reliability increases

Now have fun coding!

Add-on: Simple concepts for a complex world

Because Erlang is supposed to run systems that never have downtime, it has a feature to replace code in running systems and migrate state. Before reading the thesis I thought this would be implemented by some very complicated mechanisms bordering on magic. But like most other parts of Erlang, it turns out to be surprisingly simple. The resulting code can be found on page 98 and basically receives a function object to execute and calls the next iteration of the event loop with this function instead of the old one. Simple as that, implemented in 3 lines (with the complete server code filling just one page).