Lessons Learned from Shipping PushEx
It’s been a while since I announced the initial open-source release of PushEx. We had a few small challenges to solve in order to roll it out, and then I let it bake for several months to ensure that it is stable in production. It has been running for several months now at production scale, and I cut the first official release of it on hex.pm today. These are some of the challenges and lessons learned from running it in production.
The project was a big success
This process went very well, due to careful planning and testing on our end. We ran both systems in parallel for some time, and slowly ramped the number of pushes from 0 to 100%. Throughout this process, we pushed data to clients but did not use it there. We became confident enough to consume the data on the clients through a slow rollout process. At this point, the connections to our old provider dropped.
We were able to avoid downtime or problems throughout this process by very carefully monitoring the application. We stopped the rollout at the first sign of trouble, and then resumed once we understood the root cause.
The order of process shutdown matters
We encountered a large number of errors during the deployment process of the application. This never occurred during normal operation, so it seemed a bit odd. The root cause of the errors was that the application would still receive requests as it was shutting down. This caused messages to try to push to clients, but the processes involved in that were already shut down. This can happen with a Supervisor structure like:
[ PushExWeb.Endpoint, PushEx.Pipeline ]
If the pipeline shuts down while the Endpoint is still online, web requests will encounter errors. A better layout might look like:
[ PushEx.Pipeline, PushExWeb.Endpoint, PushExWeb.ConnectionDrainer ]
In order to gain control over the shutdown process of PushEx, I had to make changes to the application supervision tree. I separated the starting of PushEx’s core system from the web portion. These changes allow the entire web stack to be gracefully go offline (complete with connection draining) before the data pipeline goes offline.
This leads into the next problem—how to prevent data loss during application shutdown.
Connection draining is critical
Connection draining allows a web server to gracefully wait for connections to close before it proceeds with shutting down more of the application. HTTP connection draining shuts down the listener process that would accept new connections.
Several layers of draining are required for PushEx to gracefully exit. They are:
- Socket draining (sockets should gracefully disconnect)
- Ranch connection draining (new web requests shouldn’t be handled)
- Push Pipeline draining (data in flight should be given a chance of delivery)
You’ll notice that Socket draining is implemented separately from the connection draining. This is because the connection draining API stops the listener from accepting new connections only. If a connection is already established, it has to be manually killed or shut down.
This process was complex to get right, but now we can reboot servers without errors and without losing data. All of our data drains out within the 30 second limit, so your mileage would vary if that’s not the case.
Big topics affect performance
We have a few large push topics in our application. It can be costly for Phoenix Tracker to have a large number of joins in a very short period of time.
The solution for PushEx is to allow certain topics to not use Tracker. When an ignored topic
it always responds with
The likelihood that a large topic has 0 connected clients is close to 0, so it is acceptable to treat it like it always has clients. We don’t want to treat small topics as if they always have connected clients, or we’d do more work than necessary.
Keep-Alive to dramatically increase API throughput
I was a bit shocked when I implemented the server API calls in our Ruby app. We were used to 20ms calls to our old provider, but the new one was hitting 200ms! We traced the root cause to DNS slowness when establishing the connection. The solution was to utilize Keep-Alive connection pooling.
The Keep-Alive header makes it so that the connection is not closed after a response is sent. The connection is open to accept more requests and can do so without the overhead of establishing a connection. This dramatically increased the throughput of our Ruby servers to the PushEx API endpoint. We didn’t quite hit the 20ms goal, but it was close enough to call it a win.
I did hit a snag here. The connection draining for Keep-Alive connections suffers from the same problem as WebSockets do—they don’t close when the listener is closed. I had to do a bit of hackery to set a global value indicating that the Keep-Alive connection should be closed on the next request. This is a hack that I wish I didn’t have to do, but it did have the desired impact.
There were lots of challenges in rolling out the first major PushEx release, but the end result is solid. We’re running at pretty high throughput on a low number of small servers (2GB + 2VCPU for this app) without issue.
The snags we hit are not unique to our application. When building an Elixir application, you should consider both the startup and shutdown order of your process tree. Use connection draining to avoid new connections being made to a server that is in the process of shutting down. Leverage Keep-Alive headers for server-to-server API requests, especially if the throughput is high. These things do come with tradeoffs, however, so your mileage may vary.
The Book Plug
My book “Real-Time Phoenix: Build Highly Scalable Systems with Channels” is now in beta through The Pragmatic Bookshelf. This book captures the victories and struggles that we face when building real-time applications in Elixir.