28 Days - Demo - Single process in a distributed system

I’m really excited about this post, I’ve been thinking about what the demo will look like for several days, but it turned out differently than I had in my head. I’m writing this from 35k feet on my way to vacation; there’s nothing better than being on a plane to hash out technical problems that have been put off. It would be great if the wifi worked though…

Today’s post is going to be a demo + walkthrough of creating a singular process chokepoint in a distributed system. That might not be the best way to explain it, but I’m also struggling to find a better explanation. The end goal is that in a cluster of 3 nodes, I can say “request id 1” and have that go to node 1 every time. I can say “request id 3” and have that go to node 3.

Code / Demo

The code is found at https://www.github.com/sb8244/distributed_process_demo. The repo has instructions for the actual demo and how it should be run.

This code is in the intermediate category, probably. However, it’s short and the tools used are core for building Elixir apps. Don’t be shy to dig in and break it locally to really learn what’s going on!

How it works

Elixir has the ability to connect to other nodes that are accessible on the network. We can use this to spawn several local nodes that simulate a networked environment. The DistributedProcess.connect() function iterates on up to 5 known node names and connects to them. This is a quick hack to get them connected.

The application creates two different top level supervisors, a Registry which allows unique storage of processes (unique by ID here) and a DistributedProcess.Supervisor.

The DistributedProcess.Supervisor does the majority of the work for distributing our process across the cluster. It accepts an id into the get_worker/1 function, and uses a modulo operator on the number of nodes to determine which node will be the lucky receiver of this request. The size of a node is fairly consistent in practice, so this seems acceptable for starters.

When the right node is chosen, the DynamicSupervisor.start_child call creates an instance of a DistributedProcess.Worker on the local or remote chosen node. This Worker is setup to be unique based on the ID. This is really helpful as it allows future calls to the same ID to return the same process.

Once a local or remote pid is returned from the DynamicSupervisor, that pid has a call request execute against it. call will return an answer synchronously, which is great for the purposes of this demo.

In the DistributedProcess.Worker, the first thing is does is actually tell itself to be destroyed in 5 seconds. This is to make the demo interesting, but also simulates the use case of a short lived cache.

The handle_call(:request) function in the worker does 2 different things, for 2 different function heads. The first is if there is a value in the local state. Then it is simply returned as is. The second is if there is no value in the local state. A random 1-1000 integer is selected and placed in the state, along with the node name. This allows us to see that the data is in fact changing every 5 seconds, and where it came from.

All of this is packaged up into 2 top level functions that are called: DistributedProcess.connect() and DistributedProcess.request(id).

Use Case

It may be desirable to have a single choke point across a cluster to handle a single type of request. For instance, maybe a certain tenant should only execute on a single server. This ensures that the requests for that tenant are serial (non-parallel).

My use case is to cache requests to a certain resource/id pair for 30-60 seconds. I want to make a single request, then return that value for the lifetime of it. I do not want to introduce a DB layer just for this caching, as maintaining that is prone to easier logic errors around cache expiration and checking.

Final Thoughts

I’m most excited about how this code doesn’t really feel remote. It can run on a local system just as well as in a clustered system, and it scales with cluster size with no additional work. That is pretty fun!

I don’t know how I would easily test this code. Not having internet led me to not test this code today. However, I will probably look into how I can test it and write about that in the future. I have had several requests for testing posts, especially if they involve complex setups.

I do not know, yet, how I would handle a situation of fetching an array of ids. I would want the cache to be distributed, but there is a challenge in not having the same id set each time. This sort of breaks the technique, but it doesn’t mean it’s not valuable for single / deterministic ids.

Thanks to my co-worker Ben for bouncing this idea around at SalesLoft. We haven’t placed the technique in production yet, but him and I were both really excited to talk about it. I’m curious to see how he would approach this type of problem, as I know he had more elegant solutions in mind.

Thanks for reading the 8th post in my 28 days of Elixir. Keep up through the month of February to see if I can stand subjecting myself to 28 days of straight writing. I am looking for new topics to write about, so please reach out if there’s anything you really want to see!

View other posts tagged: engineering elixir 28 days of elixir