Designing Elixir Supervisor Trees

Over the past year, I’ve been getting more and more into Elixir. The language and paradigm has shifted how I think about coding in Ruby and has been incredibly fun to work with. One of the core concepts of an Elixir application is the supervision tree. Let’s dive into some patterns that I typically use, along with some real code.

Supervisors and Processes

A process in Elixir is an isolated unit of execution. Processes have their own stack, “run queue”, and garbage collection. Because of this level of isolation, it helps to think of them as small programs that can talk to other small programs in an application. Processes are crucial for storing and accessing dynamic data, as well as for implementing transformations around data. A process executes all operations in its run queue sequentially, but multiple run queues can execute concurrently (leading to true concurrency). If you are curious about this specifically, I find Hamidreza Soleimani’s blog post fascinating.

A supervisor is a process that manages the lifecycles of child processes. This most often comes through as a fault tolerance mechanism. If a process crashes, the parent supervisor can choose to restart it, let it stay dead, restart it and all peers, etc. A supervisor can supervise other supervisors, which leads to a supervision tree.

Modeling a Problem

The problem that I’m going to discuss today is the model of a game server for the game generals.io. This is a turn-based “real-time” strategy game that involves capturing towns with armies and overtaking the enemy generals. Users can queue up moves to execute sequentially over turns. This allows complex moves to be planned out and executed without losing time.

In this problem, there are a few concepts that stick out to me:

Board: A board is the representation of what the current state of the game is. This would include the cell types, the state of the armies, the fog of war, etc
Command queue: This stores the entire set of commands for the game. It is write-ahead and knows what commands have been executed and which are in the future (and on which turn)
Player list: Not as obvious as the others, but we need to know what players are in a game, and what the mapping is to their login
Game: The game contains all of the above things and ticks every so often, progressing the gameplay along
Game list: Building an online game requires supporting multiple games at once

I’m going to walk through each of these concepts (in order) and explain the approach I took in implementing the supervision tree. First, let’s see what that supervision tree might look like.

Supervision Tree for Our Game

What I’m about to present doesn’t come immediately (well, maybe with more practice), and is something that came about through iterations and testing the waters with different code. This is the final product, which I’ll go into more detail on. The [] syntax indicates what the children are.

Application -> [GamesSupervisor, GameRegistry, WebEndpoint]
GamesSupervisor (1 child type) -> [GameSupervisor (dynamic)]
GameSupervisor (4 child type) -> [BoardServer, CommandQueueServer, TickServer, PlayerServer]

BoardServer [code]

When building the game, I started with the board object and naturally progressed into the BoardServer. I feel that starting on the inner nodes of the supervisor can make it more obvious what the next steps are, rather than trying to plan out everything at once.

Notes on the code:

I’ve found myself trying to keep servers as simple as possible, and executing complex logic in plain old modules which don’t have any GenServer capability. When looking at the BoardServer, I’m happy to see that there is no real logic in it, other than turn + 1.

Defining a simple interface is crucial for communication between GenServers. I’ve found myself getting stuck in between more verbose interfaces in the past and always regret whatever brought me there.

You’ll notice that I don’t use any of the GenServer helpers to remove boiler plate. I’ve found that keeping it to pure Elixir makes it easier to read in the long run, and increases my consistency across projects.

CommandQueueServer [code]

Following the theme of keeping GenServers simple, this one is even more simple. When I’m dealing with a GenServer that really only touches one thing, like this one dealing with a Queue implementation, I do like to reach for Agent. Using an Agent removes the need to build a real GenServer implementation, and increases the readability significantly.

I don’t find myself reaching for an Agent often, and will pull it out to a regular GenServer if it gets too complex.

PlayerServer [code]

This is another simple Agent implementation that I feel like got a little bit sloppy. This is due to the use of a Map rather than a custom data structure. To clean up this code, I would look at extracting the Map behavior into a specific PlayerList module.

TickServer [code]

I love this TickServer. One of my favorite lines of code in Elixir has been:

Process.send_after(self(), :start_tick, timeout)

This code is going to send a message from one process to itself, but set for the future. This allows creation of non-blocking (while the waiting is happening) GenServer implementations that are implemented entirely using native Elixir concepts.

Another interesting point here is that the TickServer accepts a ticker_fn in the server initialization. We’ll see this pop up in our GameServer later. I thought this technique was really interesting (and was sort of surprised it worked). It works due to the fact that a fn in Elixir can be passed around like any other variable, and Elixir is stateless which means that the fn doesn’t get any context from where it is defined. Be warned, however, that executing a fn in the TickServer is going to block the TickServer until that function completes; it is not going to execute in the server that defined the fn.

GameSupervisor [code]

There’s a lot going on in this supervisor, so let’s take it slow and just look at a few specific concepts.

The first thing to point out is the use of Registry:

def start_link(opts = %{game_id: id}) do
  Supervisor.start_link(__MODULE__, opts, name: {:via, Registry, {get_registry_name(), id}})
end

Using the Registry allows for a named dynamic lookup / uniqueness of a GenServer. The supervisor is defined by the id passed in, which would prevent two games from ever sharing the same id at the same time. As you can imagine, that would be bad. We can take advantage of the uniqueness as well, because another GenServer that is started with this id will return the pid in the start_link call. We’ll see this through in the GamesSupervisor below.

Another useful code snippet is for looking up a process by type under a supervisor:

defp find_child_type(sup_pid, type) do
  Enum.find(Supervisor.which_children(sup_pid), {nil, nil, nil, nil}, fn({mod, _pid, _type, _}) ->
    mod == type
  end) |> elem(1)
end

I’ve begun putting this in a Utils namespace in my new projects, but the gist is simple. By using the supervisor’s pid, we can list out all of its children and find the one with a given type (or base it on other things). GameSupervisor only has 1 child of each type, which makes this work well.

Our trust ticker_fn is defined in this GameSupervisor. The reason I took this approach is that the knowledge of how to tick should belong to something separate from what the ticking behavior is. Looking at this now, I would probably pull this code into another module since it’s so long.

GamesSupervisor [code]

We’ve finally made it to the top of the supervision tree with the GamesSupervisor. This supervisor takes advantage of the GameRegistry to provide game lookups:

def get_game(id, opts \\ []) do
  options = Keyword.merge([name: __MODULE__], opts)
  Supervisor.start_child(options[:name], [%{game_id: id}]) |> case do
    {:error, {:already_started, pid}} -> pid
    {:error, _} -> nil
  end
end

In this case, the GameSupervisor is identified by the id, which will return when {:error, {:already_started, pid}} is returned from the start_child call. I use this pattern in nearly every dynamic supervisor that I’ve written. I do typically implement start/get in the same method (starting if it doesn’t exist, getting if it does), but I found that to not work well as creating a new game involves several complex operations such as board generation.

Registry

If you use a Registry, you do need to start it as a process in your application. I generally just include it at the top level of my Application module.

Final Thoughts

One good sign that your supervision tree has come together well is a very simple mounting of it onto your final Application. Writing supervisor(Generals.GamesSupervisor, []) seems like a very clean interface for interacting with my game supervision tree, and I feel good about that.

Another key thing to think of when writing your supervision tree is what can / will execute concurrently. For instance, let’s say that your PlayerList is implemented in the same GenServer as the Board. If you want to access the PlayerList to see who is in a game, it would not be able to run concurrently with Board operations. By splitting them up into separate GenServer implementations, they can execute concurrently (keep in mind that Elixir automatically distributes across cores).

I hope this helped or solidified GenServer concepts for you. Please feel free to reach out with any cool tips or tricks.

View other posts tagged: engineering elixir