Pre-voting in distributed consensus
by simbo1905
Another top notch Paxos post from the inventor of UPaxos covers leader election in Paxos. The outlined approach is similar to that used in TRex which is based on a sloppy timeout mechanism. This post will get into why this is a must read for consensus fans.
The great thing about the new post on Paxos Pre-voting is that it contains a sketch that the technique is sound. I believe that the approach coded into TRex is equivalent. Let’s take an in depth look at the prevote approach.
First up what are we trying to solve? It is that Paxos under steady state needs a stable leader that can fix an infinite number of commands using the minimum number of messages. The minimum number per value in a three node cluster is one round trip from a leader to a second node. The paper Paxos Made Simple mentions leader election but doesn’t get into how that occurs. Why? Because it focuses on the proof of correctness. The leader election can be totally randomised with no coordination and many nodes attempting to lead simultaneously, and the protocol will still be safe. So the original paper leaves making things efficient as an implementation detail with a hint about timeouts.
The lack of detail in the original paper makes it tempting to use an external leader election service. This is both inelegant and a bad idea. Why? Such a leader election service would have to be running its own consensus algorithm. As well as architectural complexity it’s a bit circular to write a consensus algorithm that outsources an aspect of its functionality to another consensus algorithm. It also means you have more moving parts to reason about under complex failures.
Introducing more services and protocols to a distributed system simply increases the permutations of error states and the surface that network partitions, messages drops, crashes or stalls can trash our system into deadlock or livelock. If you are tempted to think it’s a quick win, then you may be falling foul of the fallacies of distributed computing.
As I will sketch below, it’s almost trivial to implement a simple, though inefficient, in-band leader election mechanism. Given how simple it is to do something safe we can see it is worth expending a modest amount of effort to make it efficient as documented in the Paxos Pre-voting blog.
Let’s sketch the simplest thing that can work assuming we are implementing Paxos Made Simple. The hint in the paper is to use a timeout. We also know that we should be using only local clocks as Paxos makes no assumptions about a global clock. Any node can time-out on the leader and issue a prepare
message. Assuming your timeouts are randomised and large enough, and messages get through the algorithm will just run and will quiesce into a healthy state. Ta da!
Perhaps I am a bit too much like the Paxos Made Simple paper in skipping some details. We have to time out on something. The simplest thing is to time out on not seeing any new values being fixed by a stable leader. If our system isn’t always busy, we can run some synthetic client code which sends a no-op value when no other client values have been run through the algorithm recently. (We don’t need to flush no-op values to the Paxos log as we don’t mind loosing them during a failover. So these heartbeats can be at a very low performance overhead so can run continuously). All nodes will learn about the chosen values (genuine or synthetic ones) and can timeout on not seeing anything for a while which may indicate that the leader has crashed.
Okay so if the leader node crashes some nodes will timeout and attempt to lead. They simply do this by issuing a prepare
message then running the normal algorithm. Paxos won’t be incorrect under such circumstances. The challenge is that we can enter a form of livelock where potential leaders duel. The fix to this issue is to use an exponential back-off. All potential leaders will eventually back-off enough to allow one potential leader to exchange sufficient messages to lead. Happy days.
So why does the new post have a pre-vote phase? The problem is that if a timed-out node issues a regular prepare
message using a number higher than it has seen before we can disrupt a stable leader. Imagine a three node cluster where the link goes down between the leader and one other node. We then have a partial network partition where one node believes the leader is dead. If it times out and issues a high prepare
message it will unnecessarily disrupt the stable leader. The third node will make a promise and then reject further accept messages from the leader. The third node will be seeing two leaders attempt to slug it out with it being the middle man.
The Paxos Pre-voting post looks to avoid the unnecessary duel by using a seek-votes
message. That won’t disrupt the stable leader. In the scenario above the third node will learn about the active leader from any node in the active majority. The respondent can also eagerly transmit the latest fixed values to the node that has lost connectivity to the leader. The in-depth post gets in the fine details. Great stuff.
TRex takes a slightly more minimalistic approach. Rather than a timed-out node issuing a regular prepare
message using a high number it issues a prepare
with a zero ballot number. This is not going to disrupt a stable leader as it will be rejected by all nodes. Each node will negatively respond (“nack”) any prepare
message where it has made a higher promise. The nack message contains additional information about the possible existence of a stable leader. The node which is not seeing the leader can then send retransmission requests to a node that can see the leader. Only if the timed out node gets a nack from enough nodes for it to form a major, with no evidence of a stable leader, will it issue a high prepare message and follow the normal algorithm.
I agree that TRex’s approach looks very similar indeed. The lowball `prepare` and subsequent `nack` fulfill the same roles as the `seek-votes` and `offer-vote` messages.
The one advantage of having them be separate types of message, rather than re-using existing messages, is that it’s somewhat clearer that two nodes never get into an infinite loop of repeatedly sending `prepare` and `nack` messages at each other. It’s probably true of TRex too, just not as clear.
That might be a slightly subtle point. Paxos treats nacks as an optimisation (I should dig out the citation for that but I am confident of that statement). It would be a clear defect for TRex to promote from Follower to Recoverer, or from Recoverer to Leader, on a nack. I am highly confident that there is no such bug. The behaviour with acks is to follow the published algorithm which is mechanically simple and well covered by tests. That does have an infinite loop which is a leader duel.
The only bit of “novel” logic is the lowball prepare. Effectively that’s an additional state not explicitly named in the PaxosRoles in the TRex code. I have raised issue #28 to add an explicit Candidate state and refactor the code such that it is more self-documenting that a Follower issues a lowball and waits for a majority of nacks before issuing a high prepare on the transition to be a Candidate when it sees no evidence of a leader. At which point the equivalence of my approach to yours would be more obvious. Repeating your analysis would then be a good idea.
Agreed that explicitly modelling things and you have makes things clearer. For example, anything involving a leader election should probably be clearly logged, so having separate leader election states makes it easier to see that the logging is correct. On the flip side TRex has a social agenda to try to avoid significant deviation from the paper Paxos Made Simple other than to do the “obvious” optimisations such as using nacks to propagate information to improve efficiency.
I also recall reading somewhere that `nacks` are optional and suggested as an optimisation, so it must be true.
FWIW I don’t use `nacks` at all: a stale `prepare` can simply be dropped if you’re using pre-voting, because the `offer-vote` messages guarantee the resulting `prepare` won’t be stale unless there was another active node at the time.
Perhaps a thing I’ve omitted to mention is that the `seek-votes` -> `offer-vote` -> `prepare` -> `promise` -> `propose` -> `accept` conversation happens very quickly (approximately 3x median RTT) which makes the probability of two simultaneously active nodes very low indeed once the timeouts grow big enough. 10x RTT seems amply big enough, which ties in with the numbers in the Raft paper IIRC.
TRex is doing the same things as `lowball` -> `nack` -> `prepare` -> `promise` -> `propose` -> `accept` so it is equivalant. That suggests that if I impliment issue #28 and simply refactor the lowball to be named `seek-votes` and its response as `offer-vote` then we have identical solutions. When you say you don’t use nacks do you mean you have no nacks whatsoever? TRex has nacks for normal prepare and accepts. This can tell the leader behind a partition that another leader has taken over and it will then back down.
Yep, no NACKs at all. Low prepares and proposals are just dropped.
A leader behind a partition won’t remain a leader for long – it’ll timeout to an incumbent but won’t get its next value accepted (since there’s a quorum that are following the new leader) so it times out again and becomes a candidate, sends a `seek-votes`, gets an `offer-catch-up` and ultimately learns about the new leader. NACKs would make that process a little quicker, but it doesn’t seem to make any difference to clients as the new leader is already chugging away by this point.
Put differently, if the leader finds itself in a minority partition then it won’t get any NACKs until the partition heals anyway; NACKs help a bit in a partial partition but everything tends to go slowly when the network has gone to pot anyway, so it doesn’t seem that useful to optimise for this case.