Pre-voting in distributed consensus

by simbo1905

Another top notch Paxos post from the inventor of UPaxos covers leader election in Paxos. The outlined approach is similar to that used in TRex which is based on a sloppy timeout mechanism. This post will get into why this is a must read for consensus fans.

The great thing about the new post on Paxos Pre-voting is that it contains a sketch that the technique is sound. I believe that the approach coded into TRex is equivalent. Let’s take an in depth look at the prevote approach.

First up what are we trying to solve? It is that Paxos under steady state needs a stable leader that can fix an infinite number of commands using the minimum number of messages. The minimum number per value in a three node cluster is one round trip from a leader to a second node. The paper Paxos Made Simple mentions leader election but doesn’t get into how that occurs. Why? Because it focuses on the proof of correctness. The leader election can be totally randomised with no coordination and many nodes attempting to lead simultaneously, and the protocol will still be safe. So the original paper leaves making things efficient as an implementation detail with a hint about timeouts.

The lack of detail in the original paper makes it tempting to use an external leader election service. This is both inelegant and a bad idea. Why? Such a leader election service would have to be running its own consensus algorithm. As well as architectural complexity it’s a bit circular to write a consensus algorithm that outsources an aspect of its functionality to another consensus algorithm. It also means you have more moving parts to reason about under complex failures.

Introducing more services and protocols to a distributed system simply increases the permutations of error states and the surface that network partitions, messages drops, crashes or stalls can trash our system into deadlock or livelock. If you are tempted to think it’s a quick win, then you may be falling foul of the fallacies of distributed computing.

As I will sketch below, it’s almost trivial to implement a simple, though inefficient, in-band leader election mechanism. Given how simple it is to do something safe we can see it is worth expending a modest amount of effort to make it efficient as documented in the Paxos Pre-voting blog.

Let’s sketch the simplest thing that can work assuming we are implementing  Paxos Made Simple. The hint in the paper is to use a timeout. We also know that we should be using only local clocks as Paxos makes no assumptions about a global clock. Any node can time-out on the leader and issue a prepare message. Assuming your timeouts are randomised and large enough, and messages get through the algorithm will just run and will quiesce into a healthy state. Ta da!

Perhaps I am a bit too much like the Paxos Made Simple paper in skipping some details. We have to time out on something. The simplest thing is to time out on not seeing any new values being fixed by a stable leader. If our system isn’t always busy, we can run some synthetic client code which sends a no-op value when other client values have been run through the algorithm recently. All nodes will learn about the chosen values (genuine or synthetic ones) and can timeout on not seeing anything for a while which may indicate that the leader has crashed.

Okay so if the leader node crashes some nodes will timeout and attempt to lead. They simply do this by issuing a prepare message then running the normal algorithm. Paxos won’t be incorrect under such circumstances. The challenge is that we can enter a form of livelock where potential leaders duel. The fix to this issue is to use an exponential back-off. All potential leaders will eventually back-off enough to allow one potential leader to exchange sufficient messages to lead. Happy days.

So why does the new post have a pre-vote phase? The problem is that if a timed-out node issues a regular prepare message using a number higher than it has seen before we can disrupt a stable leader. Imagine a three node cluster where the link goes down between the leader and one other node. We then have a partial network partition where one node believes the leader is dead. If it times out and issues a high prepare message it will unnecessarily disrupt the stable leader. The third node will make a promise and then reject further accept messages from the leader. The third node will be seeing two leaders attempt to slug it out with it being the middle man.

The Paxos Pre-voting post looks to avoid the unnecessary duel by using a seek-votes message. That won’t disrupt the stable leader. In the scenario above the third node can see the leader and simply respond to the new message with enough information to avoid starting a leader duel. The respondent can also eagerly transmit the latest fixed values to the node that has lost connectivity to the leader. The in-depth post gets in the fine details. Great stuff.

TRex takes a slightly more minimalistic approach. Rather than a timed-out node issuing a regular prepare message using a high number it issues a  prepare with a zero ballot number. This is not going to disrupt a stable leader as it will be rejected by all nodes. Each node will negatively respond (“nack”) any prepare message where it has made a higher promise. The nack message contains additional information about the possible existence of a stable leader. The node which is not seeing the leader can then send retransmission requests to a node that can see the leader. Only if the timed out node gets a nack from enough nodes for it to form a major, with no evidence of a stable leader, will it issue a high prepare message and follow the normal algorithm.