Paxos Dynamic Cluster Membership

by simbo1905

One practical concern not covered so far in this blog series is how to add or remove nodes from a cluster. This is actually very straightforward and can be handled by the Paxos algorithm itself. The process is outlined in the last paragraph of the paper Paxos Made Simple

Update: In late 2016 the “stop the world” event described in this article was fixed in the paper UPaxos paper Unbounded Pipelining in Dynamically Reconfigurable Paxos Clusters. I cover that in a later post.

Recall that the Paxos algorithm proposes and then fixes values. To replicate state in a consistent manner using operations which don’t commute we simply sequentially number the operations to apply them in order at each cluster node. The idea of dynamic cluster membership with Paxos is simple: cluster membership change commands define what is a legal majority for subsequent Paxos slots.

To understand this we need a quick refresh on what a slot is. The paper Paxos Systems Demystified defines it as:

In all Paxos protocols, every chosen value (i.e., proposed client operation) is a log entry, and each entry identifier […] has two components denoted as slot and ballot number in Paxos, epoch and counter […] in Zab, and as term and index in Raft.

The protocol may run one or more rounds of Paxos to attempt to fix a value at a given slot (aka counter, aka index) but only one value will ever be fixed at each slot. Why? By design; this invariant of the Paxos algorithm is why it solves the problem of distributed consensus. This means that to change the cluster membership we simply have the protocol fix a special reconfiguration command into the next slot. This reconfiguration command defines the cluster membership for the subsequent slots. The algorithm itself ensures that all nodes know the cluster membership in effect at each slot. All nodes must know the correct cluster membership at each slot to be sure that they use overlapping quorums.

The paper Paxos Made Simple has an addition generality expressed by `α`. This is the distance in successful round numbers between a membership change being selected and it coming into effect. With Paxos with primary ordering for transaction log replication as described in a previous post it is natural to set `α` to be 1. The change comes into effect for the next Paxos slot. This is easy to achieve when streaming accept commands by only evaluating the responses from other nodes just-in-time when the slot immediately preceding it has been fixed.

Defining `α` to be 1 is natural when applying Paxos to transaction log replication which uses strict commit ordering. So when might we define it to be higher and why is that useful? The answer is that it allows some concurrency in the of processing of values in applications where that is desirable.

To see how that might be useful consider a game server where there are many hundreds of concurred game arenas running for many thousands of players. Commands for each game arena must be applied in order across the Paxos cluster. Commands for different game arenas can run in a different order on each cluster node. We can hash the game arena id and bucket the commands at the leader into, say, 32 queues. The leader can fix the ordering of events at a global level but can have 32 threads committing values in parallel in 32 partial orderings. Such a “pipelined” design allows for a high degree of concurrency across all games whilst maintaining strict ordering within each game.

With such a parallel consensus design setting `α` to 1 would hurt concurrency. When a cluster membership admin command arrived we would have to pause the 32 parallel instances whilst we ran consensus on the cluster change command. If the network dropped that special command we would have a “stop-the-world” event until we can retransmit it and get back positive responses.

To improve on this we only need to apply a partial ordering that the admin command must be fixed by consensus some arbitrary and pre-agreed number of rounds before it comes into effect. We can uniquely number every command as it arrives and define `α` to be some multiple of the maximum queue depth, say, 128. When a cluster membership change arrives we can put that into a special admin command queue. The subsequent 128 game commands on the 32 queues are able to run as normal. If however the admin command is not fixed by the 129th subsequent or higher game command the leader will pause the 32 pipelines. We are free to define `α` to be as large or small as we desire. We can make it high enough that under normal operations there is a low probability of a stop-the-world whilst making it low enough that admin commands take effect in a timely manner.

Update: A future posted revisits this topic  as the problem of stop the world stalls was solved by UPaxos in late 2016.