Paxos Voting Weights

by simbo1905

The last UPaxos post took a run through the reconfiguration stall avoiding aspects of the UPaxos paper. In this post, we will take a quick look at how the paper describes hot swapping of nodes using voting weights.

So what’s a voting weight? It’s an integer multiplier that a given node has within a Paxos quorum. Traditional Paxos has all nodes have a weight of “1” such that every node is equal. So what can we achieve by varying the weights?

Well, straight away you get “learners” by configuring acceptors with a zero weight. The original Paxos papers decomposed Paxos into proposers, acceptors and learners roles. Learners are nodes that track the outcome of the consensus algorithm without participating in the consensus quorums. They are replicas which can be used to scale read load. Simply applying a zero voting weight makes an acceptor a learner. Better yet if an acceptor dies you can promote a learner by assigning it a voting weight of “1”. Happy days.

What else? Well, it allows us to set up leader casting vote scenarios that we need for UPaxos cluster reconfigurations. The paper gives a very practical scenario which is upgrading a server. It points out that cloud providers today typically have three rather than four availability zones in each region. If we just replace a node, by adding the new one then deleting the old one, we pass through the following three states:

Zone A Zone B Zone C
Server W Server X Server Y Server Z
1 1 1
1 1 1 1
1 1 1

The problem is that we temporarily have four nodes with two of them in the same availability zone. If we loose Zone C in that state then we only have half the cluster. With the FPaxos even nodes optimisation, a leader in Zone A or B will continue to process client commands. The problem is that you cannot elect a new lead if it dies.

To get around this we can use voting weights. In the following table if we lose Zone C at any point a leader in Zone A or B can still get a majority:

Zone A Zone B Zone C
Server W Server X Server Y Server Z
1 1 1
2 2 2
2 2 2 1
2 2 1 1
2 2 1
2 2 2
1 1 1

At each step in that process, we either doubled all weights, halved all weights, or changed one node’s voting weight by one. Why? To ensure the majorities between consecutive reconfigurations overlap.

It is fairly obvious that if you double or halve all weights between consecutive configurations a simple majority in each will overlap. Any integer scaling factor has that property. Adjusting only one node by 1 is a little more subtle. This is equivalent to adding or removing one node in a cluster where every node has a voting weight of 1.  If we consider the sequence of integers {3, 4, 5, 6, 7, ...} then we can see that consecutive majorities do indeed overlap. If we were to skip a number such as changing from 5 to 7 then majorities wouldn’t overlap and we would have violated the UPaxos safety constraints.