Distributed In-Memory Caching in Elixir

I’ve been working with implementing our PushEx server at SalesLoft (a bit later than I thought I’d have time to) and one of the challenges has been to map user identities in our authentication token to a legacy identity that we use in our push channels. This is as simple as an HTTP call, but it is also something that could potentially burst very hard and cause a large number of downstream HTTP calls. In order to help alleviate this, the identity -> secondary identity will be cached.

I would typically just throw a cache in redis or memcached, but I really wanted to reach for something simpler in this situation. My goals for this project are:

  • No external databases introduced (there are none currently on our PushEx implementation)
  • Caching can be persistent between deploys (we roll our deploys so some pods are always online)
  • Caching is conceptually simple in case it needs debugged by another engineer in the future
  • Identity mappings will be stable between services, so time variations are not required. However:
  • Caching will use a TTL to prevent really stale values from being used

I’m going to walk through the different options I looked at and what I ended up on. The options I passed on aren’t necessarily bad, they just ended up being operationally or conceptually more difficult than I needed.

Swarm

There was a great talk in the Atlanta Elixir Meetup recently about using Swarm to manage processes across a cluster. I really liked a few things that the presentation demonstrated:

  • Processes are stable such that restarting a node with the same identity will put the process back on that node
  • Processes are capable of handing themselves off to another node

My initial plan was to put Cachex or another store in sharded processes that are distributed across the cluster. However, I soon realized that passing off this ets state may be fairly complex and so I ended up looking at the cache as a simple in-process memory map. That is actually fine with me, but some other things didn’t work out for it.

The biggest issue that I ran into was that process hand-off became fairly complex to implement. I never actually got that working properly after ~4 hours of work on it. Another bigger issue is that there is no process state replication. This means that my shard values would have to pass themselves off fully to another node between the time that the node is told to shut down and the near future. If the node was forcibly killed before then, the data would be lost and the cache would be re-populated.

I ended up moving on from this to trying out the next solution because it seemed like I was getting myself into a solution I didn’t need. That will be a theme in this blog post: there is nothing particularly wrong with technology X, but the trade-offs it brings may be more than worth it for the particular use case I’m working with.

Delta CRDT

After talking with my colleague Dan about this topic (he will be presenting on distributed state at Lonestar Elixir Conf and it will be awesome), he suggested looking at DeltaCRDT as a potential solution. I really liked this library for a few reasons:

  • State is replicated across the cluster, so shutdowns are not frantic
  • CRDT gives a lot of benefit around time variation in the cluster (node A has a different value than node B at a point in time)

We ended up getting a working solution that used the DeltaCRDT library. Our code looked something like this (don’t use this code):

defmodule SalesloftPusher.AccountLookupCache.Monitor do
  use GenServer

  def start_link(_) do
    GenServer.start_link(__MODULE__, [], name: __MODULE__)
  end

  def init(_) do
    :net_kernel.monitor_nodes(true)

    {neighbors, []} = :rpc.multicall(Node.list, Process, :whereis, [AccountLookupCache])
    DeltaCrdt.add_neighbours(AccountLookupCache, neighbors)

    {:ok, []}
  end

  # Callbacks

  def handle_info({:nodeup, node}, state) do
    handle_info({:retrynodeup, node, 0}, state)
  end

  def handle_info({:retrynodeup, node, count}, state) do
    pid = :rpc.call(node, Process, :whereis, [AccountLookupCache])

    if pid == nil do
      IO.puts "Node is up, but app not booted, retry = #{count}"
      Process.send_after(self(), {:retrynodeup, node, count + 1}, 500)
    else
      IO.puts "Node is now up #{node} #{inspect(pid)}"
      DeltaCrdt.add_neighbours(AccountLookupCache, [pid])
    end

    {:noreply, state}
  end

  def handle_info({:nodedown, _node}, state) do
    {:noreply, state}
  end
end

usage:

DeltaCrdt.read(AccountLookupCache)
DeltaCrdt.mutate(AccountLookupCache, :add, ["1111/2", "4"])
DeltaCrdt.read(AccountLookupCache)

This is a pretty slick library and the setup was fairly simple. This code may have some edge cases in it but we ran into some performance issues with larger data sets and moved onto another solution. While we did end up moving on from it, the author has been working hard on a refactor to improve the speed of the data structure. What’s he is doing is way beyond my understanding of CRDTs and is pretty inspiring open-source work.

This issue got me thinking and I realized that I didn’t need a lot of the benefits of the CRDT. I wanted replication across the cluster, but my values are stable and so time variations won’t be a factor. I would most likely have stuck with DeltaCRDT at this point if that was a factor, but I ended up moving onto my current solution.

Cachex + pg2 Replication

My final solution involves a tried and true solution around Cachex + pg2. I’ve written about pg2 in the past and have used it successfully in production on several projects. It essentially lets us place our cache processes in a group across the cluster and reference the remote cache processes as a pid list.

The solution presented below utilizes Cachex for all local set/get/stat, and passes messages containing sets to the cluster using send. When the cache process starts it notifies all neighbors in pg2 that it would like a dump of the state and then writes that into Cachex. Here are some strengths and weaknesses of the solution:

  • + Boot based (rather than shutdown) based replication so that nodes do not lose data when they go down
  • + Cachex for local cache management (so we get all of that library’s benefits)
  • + Efficient writing and loading of an export. In testing it took less than 3 seconds for a 1,000,000 key cache locally (higher across network)
  • - The entire cluster being down will cause cache data loss and an increase in misses
  • - There is no consensus on what the right cache is, it’s best attempt
  • - Possible flood of binary across network on boot with many nodes

The biggest disadvantage is the last one and I think it will be fixed before I take this code into production. It is a purely effort based blocker (I need to write the code) and conceptually will work just fine.

The Code

The code is on a gist due to being fairly long: The Code

The end result is a straightforward set/get interface which handles all of the distribution and caching. I did a few basic performance tests of the system by throwing 10k, 50k, 500k, 1000k k/v pairs into the cache and seeing how it performed. Writes and distribution were incredibly fast and rebooting the application caused cache availability within a few seconds, well before the app would finish booting for kubernetes health checks. There was one caveat I noticed which is that memory usage spiked while loading the dumps from remote servers. I believe that the best solution here will involve me changing to a solution that collects the size of each remote cache and selects the top 1 or 2 sized caches. That will prevent N servers from sending full cache dumps and only 1 or 2.

Summary

In summary, three different potential solutions were evaluated for this distributed caching problem. While the first two options utilize great libraries and would be possible to build on, the trade-offs were too much for the simplicity of my needs.

When working on a software project like this, considering what your exact needs are is important and may actually lead you away from the typical libraries into a different solution. It may seem obvious for many, but it is very easy to get caught up in the libraries that we’re hearing about rather than what is best for our particular use case.


Thanks for reading! I’ll be speaking at Lonestar ElixirConf about bringing Elixir to production, looking at both human and tech challenges in doing so.

View other posts tagged: elixir engineering