28 Days - Demo - Single process in a distributed system
I’m really excited about this post, I’ve been thinking about what the demo will look like for several days, but it turned out differently than I had in my head. I’m writing this from 35k feet on my way to vacation; there’s nothing better than being on a plane to hash out technical problems that have been put off. It would be great if the wifi worked though…
Today’s post is going to be a demo + walkthrough of creating a singular process chokepoint in a distributed system. That might not be the best way to explain it, but I’m also struggling to find a better explanation. The end goal is that in a cluster of 3 nodes, I can say “request id 1” and have that go to node 1 every time. I can say “request id 3” and have that go to node 3.
Code / Demo
The code is found at https://www.github.com/sb8244/distributed_process_demo. The repo has instructions for the actual demo and how it should be run.
This code is in the intermediate category, probably. However, it’s short and the tools used are core for building Elixir apps. Don’t be shy to dig in and break it locally to really learn what’s going on!
How it works
Elixir has the ability to connect to other nodes that are accessible on the network. We
can use this to spawn several local nodes that simulate a networked environment. The
DistributedProcess.connect()
function iterates on up to 5 known node names and connects to them. This is a quick hack to get them connected.
The application creates two different top level supervisors, a Registry
which allows
unique storage of processes (unique by ID here) and a DistributedProcess.Supervisor
.
The DistributedProcess.Supervisor
does the majority of the work for distributing our
process across the cluster. It accepts an id
into the get_worker/1
function, and
uses a modulo operator on the number of nodes to determine which node will be the lucky
receiver of this request. The size of a node is fairly consistent in practice, so this
seems acceptable for starters.
When the right node is chosen, the DynamicSupervisor.start_child
call creates an instance
of a DistributedProcess.Worker
on the local or remote chosen node. This Worker is setup to
be unique based on the ID. This is really helpful as it allows future calls to the same ID
to return the same process.
Once a local or remote pid is returned from the DynamicSupervisor, that pid has a call
request
execute against it. call
will return an answer synchronously, which is great for the purposes
of this demo.
In the DistributedProcess.Worker
,
the first thing is does is actually tell itself to be destroyed
in 5 seconds. This is to make the demo interesting, but also simulates the use case of a short
lived cache.
The handle_call(:request)
function in the worker does 2 different things, for 2 different
function heads. The first is if there is a value in the local state. Then it is simply
returned as is. The second is if there is no value in the local state. A random 1-1000
integer is selected and placed in the state, along with the node name. This allows us to see
that the data is in fact changing every 5 seconds, and where it came from.
All of this is packaged up into 2 top level functions that are called: DistributedProcess.connect()
and DistributedProcess.request(id)
.
Use Case
It may be desirable to have a single choke point across a cluster to handle a single type of request. For instance, maybe a certain tenant should only execute on a single server. This ensures that the requests for that tenant are serial (non-parallel).
My use case is to cache requests to a certain resource/id pair for 30-60 seconds. I want to make a single request, then return that value for the lifetime of it. I do not want to introduce a DB layer just for this caching, as maintaining that is prone to easier logic errors around cache expiration and checking.
Final Thoughts
I’m most excited about how this code doesn’t really feel remote. It can run on a local system just as well as in a clustered system, and it scales with cluster size with no additional work. That is pretty fun!
I don’t know how I would easily test this code. Not having internet led me to not test this code today. However, I will probably look into how I can test it and write about that in the future. I have had several requests for testing posts, especially if they involve complex setups.
I do not know, yet, how I would handle a situation of fetching an array of ids. I would want the cache to be distributed, but there is a challenge in not having the same id set each time. This sort of breaks the technique, but it doesn’t mean it’s not valuable for single / deterministic ids.
Thanks to my co-worker Ben for bouncing this idea around at SalesLoft. We haven’t placed the technique in production yet, but him and I were both really excited to talk about it. I’m curious to see how he would approach this type of problem, as I know he had more elegant solutions in mind.
Thanks for reading the 8th post in my 28 days of Elixir. Keep up through the month of February to see if I can stand subjecting myself to 28 days of straight writing. I am looking for new topics to write about, so please reach out if there’s anything you really want to see!