28 Days - Elixir Node Networking Basics

One of my favorite points in Elixir, some form of magic perhaps, is how simple distributed networking can be. The language itself seems to make establishing networked nodes and sending messages a pain-free exercise. While breaking down networking techniques is more than a single post, I am going to look into some of the basics of networking in Elixir. Specifically, what happens when we connect nodes together?

Nodes in Elixir

In Elixir, a Node could be defined as a single running instance. There can be multiple nodes running on a single machine. Let’s use this to take a look at some basics of node communication, before diving into what is going on.

# In shell 1
iex --name [email protected]

# In shell 2
iex --name [email protected]

# In shell 1
Node.self() # :"[email protected]"
Node.list() # []
Node.connect(:"[email protected]") # true
Node.list() # [:"[email protected]"]

# In shell 2
Node.list() # [:"[email protected]"]

# Leave open

From the above interactive example, it’s possible to see that 2 nodes are initially started, with a particular name of [email protected] or test2. These nodes start as disconnected, but can be manually connected. Once connected, both nodes are aware of the other node’s existence.

Finally, we can test out sending a message in the above processes:

# In shell 1
Node.spawn(:"[email protected]", fn -> IO.inspect(Node.self()) end)
#PID<13084.113.0>
:"[email protected]"

In the above example, a function is set to be executed on :"[email protected]". We can see that the result of the execution is a remote pid (doesn’t start with 0) as well the output containing the node’s name.

The Node module contains lots of interesting tidbits that are actually implemented pretty thin on top of erlang modules. This post won’t go into the ins and outs of node communication, although it’s suffice to say that communication is generally done in better ways than spawning functions between the nodes.

Diving into a connection

Node.connect provides a really simple 1-liner over :net_kernel.connect_node/1. Let’s dig into this function and what some ramifications of using it are.

Deep inside of the erlang OTP libraries, we start to see some interesting code in connection. Specifically, there is code to handle automatic “magical” connection between nodes, vs more explicit connection dependencies.

Digging even further, we discover that the net_kernel symbol is actually a process on the system. Running Process.whereis(:net_kernel) will return the pid of this net_kernel process.

The first time that a Node is connected to, that connection is not present in the ets lookup table. This leads to setup being called and initializing that connection.

By digging into Process.whereis(:net_kernel) |> :sys.get_state(), it’s possible to see that there is a structure like:

{:listen, #Port<0.609>, #PID<0.49.0>,
    {:net_address, {{0, 0, 0, 0}, 56479}, 'Steves-MBP', :tcp, :inet},
    :inet_tcp_dist}

This state is documented and helps to let us trace the module that will actually connect our nodes together. Finally, we are able to track down the setup code that is creating the TCP socket between the nodes.

One small caveat of connecting to nodes is that “cookies” have to match up. This is essentially an atom that is initialized on boot and read from ~/.erlang.cookie. Finding this was pretty challenging in erlang, but I traced it back to dist_util module. The code above (even if you can’t read erlang) is doing some interesting challenge/reply protocol to ensure that the nodes are allowed to talk to each other.

After the connection

Of course, connecting via TCP here is just the beginning. There is significantly more at play, cookies to authenticate nodes and heartbeats to ensure connectivity between nodes for starters. The distributed erlang guide from learnyousomeerlang.com goes over some of these concepts in detail. I’ll be tackling them in a future post as well.

Addendum

It was asked in the Elixir slack group if it’s possible to customize the distribution mechanism from TCP/IP to something else. It is! It’s fairly involved C code, but there is an example walking through a OS socket level distribution.

In addition, some other drivers are provided out of the box, such as a SSL driver for communicating over SSL.

I don’t really read erlang code, so this post was very interesting today. However, a lot of great things can be learned from digging into the erlang code and module documentation and not relying solely on Elixir docs. For instance, the documentation for node networking brings up great points around security and TLS node communication. Dive into the docs and see what you go from there; it might be useful one day when there’s a problem that you just can’t figure out.

Thanks for reading the 5th post in my 28 days of Elixir. Keep up through the month of February to see if I can stand subjecting myself to 28 days of straight writing. I anticipate a few more posts around networking, such as cookie gotchas, distillery release networking, and pg2.

View other posts tagged: engineering elixir 28 days of elixir