Paxos Uses Leaders (Multi-Paxos is the Parliament Protocol and Basic-Paxos isn’t)
A common misconception about the Paxos Algorithm is that it doesn’t use a leader. With this world view the Paxos algorithm is an enimic peer-to-peer algorithm which is impractical and it has to be extended with a separate flavour called Multi-Paxos to do anything useful. This is a back-to-front world-view which is often put up as a strawman by advocates of aggressively marketed alternatives.
Lets go to the sources to see what’s really going on here. One of the unfortunate historical accidents of the Paxos Algorithm is that the original paper The Part Time Parliament was written jokingly in the style of a fictional story. Lets skip over that for the minute and use the paper Paxos Made Simple which the same author wrote to demystify the original as the definition of what is and what is not Paxos. On page 6 in the section entitled “The Implementation” we find:
The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner. [Paxos Made Simple]
Then on pages 9 and 10 under the section “Implementing a State Machine” we have:
In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm. [Paxos Made Simple]
Aha! That’s clearly two algorithms and the second one is “Paxos For State Machines”! Well how does to it describe these complex state machines?
A simple way to implement a distributed system is as a collection of clients that issue commands to a central server. The server can be described as a deterministic state machine that performs client commands in some se- quence. The state machine has a current state; it performs a step by taking as input a command and producing an output and a new state. [Paxos Made Simple]
Er, it’s just state you update in a deterministic manner. Like, say, an in-memory list you update over the network with message passing. The kv-store demo of Trex is a replicated map which fits into this model which is about as “Hello World” as you can get for distributed systems. The sorts of things that are excluded from this definition are things like CRDTs. Apologies for labouring the point here but I another misconception I have come across is “Multi-Paxos is for complex finite state machines and most applications like a kv-store don’t need that complexity”. Amazingly the person arguing that point was actually citing the same paper so clearly they didn’t actually read it. They have it back-to-front; anything which isn’t covered by this FSM model is a level of complexity that most applications can, and probably should, avoid.
What is really going on here is that Lamport proved the correctness of a class of applications known as Paxos by stripping it down to a mathematical model that can be reasoned about. He called this “The Single-Decree Synod” in the original bad joke paper:
Paxon religious leaders asked mathematicians to formulate a protocol for choosing the Synod’s decree. The protocol’s requirements and assumptions were essentially the same as those of the later Parliament except that instead of containing a sequence of decrees, a ledger would have at most one decree. The resulting Synod protocol is described here; the Parliamentary protocol is described in Section 3. [The Part-Time Parliament]
If you find that statement confusing don’t worry about it its a bad joke; literally. A translation of this in my own words would be:
“In order to prove the correctness of the consensus algorithm for choosing a stream of commands we can first demonstrate the correctness of a mathematical model which chooses a single command. The mathematical model for selecting a single command can then be extended to the practical algorithm for selecting a stream of commands (Section 3) as long as the assumptions of the single command mathematical model are not violated.” – simbo1905
In order to justify my interpretation we can look at Section 3 entitled “The Multi-Decree Parliament” which says:
Instead of passing just one decree, the Paxon Parliament had to pass a series of numbered decrees. As in the Synod protocol, a president was elected. Anyone who wanted a decree passed would inform the president, who would assign a number to the decree and attempt to pass it. Logically, the parliamentary protocol used a separate instance of the complete Synod protocol for each decree number. However, a single president was selected for all these instances, and he performed the first two steps of the protocol just once. [The Part-Time Parliament]
To labour the point here folks both the original “The Part-Time Parliment” paper introducing Paxos as interesting to computer scientists because of its multi-degree algorithm; the parliament protocol. That and the clarification paper “Paxos Made Simple” both define Paxos as having a distinguished leader assigning sequence numbers to a stream of commands. Furthermore the distinguished leader only sends “prepare” messages when it assumes leadership; after that in steady state the distinguished leader streams only “accept” messages. He also says else where in the paper to collapse the roles and have all servers run all three roles of the algorithm. For a full explanation of that checkout the post on transaction log replication. You can also try out Trex which implements those techniques.
Note: later Leslie Lamport uses the term “Voting Algorithm” as a replacement for the phrase “Single Degree Synod”.
So what exactly is my beef here? Well it seems the practical concepts of the Paxos paper are notoriously hard to grasp. Yet the useful parliament protocol is actually mechanically simple. Yes its got some subtleties to the proof that it works and if you actually set out to write a distributed system with it you find a lot of complexity in the failure modes. Thats a problem with the problem; not the mechanically simple solution to the problem. Trex has well past a hundred unit tests around the core paxos library code. Some of those covers the areas that Lamport doesn’t define but says that you would need to make a working system; such as state synchronisation. Off the back of that the demo app gives you an embeddable strong consistency engine you can layer into your own application. In my experience Paxos made implementing distributed strong consensus straight forward; at no point was I confused as to what the code should do in any situation. This is because the part-time parliament algorithm is easy to reason about once you comprehend it’s relatively simple invariants.
I think a big part of the problem is that Paxos is badly taught. Lamport discovered the Single Synod protocol trying to prove the impossibility of distributed consensus then immediately used it to do something useful; state replication. So it was natural for him to write it up in chronological order. Clearly Paxos is a work of genius but given the confusion over it perhaps we shouldn’t be teaching the topic the way Lamport introduced it; I believe the world would be less confused if he had written the paper the other way around and discussed the parliament protocol in the first part of the paper then introduced the Single Synod protocol as an abstract model to prove the correctness of the useful algorithm which is the subject of the title of the paper.
If your a mathematician, academic computer scientist, or someone with a hard science education that focused on abstract proofs, then perhaps the way Lamport introduced the topic isn’t confusing. You must be in the minority though given all the confusion about the subject. If you are an educator then I implore you not to teach the topic that way around. First introduce “what we are trying to achieve” such as replicating the state of a table and introduce the multi-decree parliament protocol. Then teach “how do we prove that it is correct” with the Single Synod protocol. That is then the bridge into all the derivations which try to optimise the protocols which share the same foundations. Then you may give engineers going out to work in the brave new world of distributed computing the practical technique and theoretical toolkit that the original paper was trying to gift to the world. Its a shame it was all Greek to most people.