By Ryan Spore
Kohort is an opinionated remote meeting platform we built recently and use Daily at Mechanical Orchard. It lets distributed teams have meetings with more structure, transparency, and connection, with an integrated meeting agenda, live transcription, AI-summarization, and much more.
Whenever you join a meeting via Kohort, we spin up an Elixir process that will live for the length of the meeting. It does a variety of things, from sampling the transcript for keywords, to orchestrating API calls as the meeting starts and ends. We need this process to have continuity throughout the length of a meeting (and meetings can last a long time amirite?). But machines die, deploys happen, and without care, our Meeting process continuity is lost.
If you also have long running processes, you will also face this problem: What happens to a long lived process when the machine it’s running on dies?
We really wanted the answer to that question to be “A new process starts immediately on another machine, picks up from where the dead process stopped, and everyone else starts talking to the new process without even noticing the difference.”
We got pretty close! I’ll tell you how.
We started with Horde. Horde provides distributed DynamicSupervisor and Registry implementations, which we'll use to supervise our long running processes and maintain their uniqueness.
Let’s start with a long running process GenServer skeleton, something like this:
It’s mostly a basic structure for a GenServer, but we’re using a :continue-d startup, and we’re registering the GenServer with our DistributedRegistry.
You can start it with our Horde supervisor:
To have meaningful handoff, you need your application to operate as a cluster. We use libcluster's kubernetes strategies to get our BEAM nodes talking together (see this blog post for an excellent guide for how to get those to play nicely).
Once you’ve got the application clustering, you can set up Horde to supervise your long lived processes.
In this state, if the node your process is running on died, Horde would restart it on a different node, but with the same starting state that was provided in the start_child call.
Graceful State Handoff
In order to transfer state on the new node, we need to handle the exit signals Horde uses when a node dies. First, we need to trap exits:
Here we intercept the normal shutdown message triggered by the node shutting down, and write some handoff state to the Horde. Registry meta, and then sleep to give the Registry a chance to sync with other nodes.
Your application needs to have some shared datastore that both dying and spawning nodes can communicate with. This could be a database or other external datastore, but we chose to use the Horde. Registry meta, which is kept synchronized across a cluster using a CRDT. In order to use this effectively, we needed to ensure our nodes continued to be clustered together until they shut down completely to enable this handoff, but were removed from the load balancer so users didn't connect to the dying nodes. Using the pods lookup mode, we were able to keep pods clustered during their shutdown, after they've already been removed from the endpoint managing load balancing.
If your long running process isn’t eternal, you need an alternate way to shutdown the process that clears the handoff state, like this:
Clumsy State Handoff
Graceful shutdown isn’t guaranteed, so to hedge against a more clumsy handoff, we write as much state as is available eagerly, and add a more complex state recovery step. You can try reconstituting state from a database or event log.
With all this in place, you have a system that moves long running processes off of dying nodes, and uses the best available information to keep them running from where they left off.
Horde's distribution is eventually consistent, so sometimes duplicate processes will get started. You can look into merge conflict resolution to handle the messages Horde uses to resolve these conflicts.
You can also enhance your clumsy handoff recovery by adding periodic writes to the clumsy handoff state or improving the state recovery logic when a new node attempts a clumsy recovery.
Elixir has proved to be an excellent (and enjoyable!) tool for building a modern, resilient, scalable RTC application. Phoenix and Liveview allow us to build responsive app quickly and OTP allows us to build a scalable and fault tolerant video call experience. It’s been a pleasure working in this ecosystem to deliver an innovative product that’s always evolving and improving.