The Trial Of Paxos Algorithm
The Paxos Consensus Algorithm gets a bad rap. I think there is something quite beautiful, simple and intuitive hiding within Paxos; collaboration. Raft which was designed as a replacement restores the master-slave concept to supremacy. That may be a pragmatic thing to do yet I do not consider the master-slave model elegant (note: this is a subjective statement of my aesthetic taste not a statement of utility). Given all the bad press Paxos gets I worry that people will skip over it and miss out on enjoying its elegance before moving onto alternatives. They may then fail to use Paxos for the wrong reasons else avoid it in scenarios where it may be better suited than alternatives.
I am probably going to regret this but I am going to make the case for the defence of Paxos. I will do this in three parts. First I will have a go at explaining why I think Paxos is cute, elegant and exactly the sort of algorithm you want to truly get your head around. I will be a character witness in the court of public opinion. Second I will say that the accused, Paxos Algorithm, was framed and put on trial for a crime it did not commit. Then we will cross examine some star witnesses for the prosecution and bring in our own celebrity expert witness. Finally I will say that although Paxos should be acquitted of all charges that doesn’t imply you should be using it for every problem you have at hand.
Before we begin if you want a good introduction to Paxos I would recommend the Wikipedia Paxos entry then The Paper Trail description of Paxos. The authoritative source of what is or isn’t Paxos is Paxos Made Simple which explains on Page 10 that multi-Paxos is Paxos which is a fact that some folks seem to overlook. A good “round-up” paper on what Paxos is used for is Consensus in the Cloud: Paxos Systems Demystified. Warning: there is an ambiguity in the paper Paxos Made Simple that Lamport discusses here which is actually a question and answer on stackoverflow you can read here. In those videos Lamport walks through Paxos as a formal spec. If you are thinking of implementing Paxos you should follow the formal TLA+ spec walked though in that lecture.
First up: why do I think Paxos is cool? Well I think it can be intuitively comprehended as long as you are clear about the problem it solves. That problem is arranging to come to a consensus about a sequence of committed events, across a collection of computers, via asynchronous messages, without reliable clocks, nor a perfect network.
This is useful as you can do things like replicate a finite state machine (eg database) across a cluster by making the committed events the operations applied to the machine. This is a hard problem to solve when the network does strange things. The messages about proposed or committed events may fail, be reordered, may be indefinitely delayed, or may be delivered multiple times, but you won’t get inconsistencies due to the logical Paxos algorithm, only the bugs in your implementation.
Your cluster nodes may crash (fail-stop), else pause indefinitely (fail-recover), due to vm or disk stalls, else temporary connectivity loss. You don’t have to figure out everything which could go wrong such as random network foobars, firewall configuration errors only letting through traffic in one direction, network hardware failures, or combinations of all of the above. In the worst case you will fail to make any progress until the interruptions are stabilised; but unless you have bugs you will not get data inconsistencies of committed work across the cluster. Paxos can then self-heal the cluster as long as a majority of servers can remain communicating.
Here is what I believe are the cool and intuitive comprehensible bits as to how Paxos does its magic (the details and subtleties are discussed at the links above):
- Nodes collaborate on agreeing an ordered sequence of events using ballot numbers. Ballot numbers must be unique to each node but don’t need to be contiguous. The ordering ensure that every node can fix the same value even when messages may get lost, delayed or reordered. This approach of distributed counters acting like a global clock removes the need for an authoritative leader, synchronised hardware clocks or any network guarantees.
- A node which is acting as the leader, and which proposes the next event number and payload to a majority, learns about whether it is safe to commit, and which event payload it is safe to commit, from affirmative responses from a majority of nodes. Rather than explicit elections any node which thinks the leader has failed or stalled, and which steps up to be a leader, is implicitly acknowledged as the leader by hearing that a majority accept its proposal to set the next highest event number and payload. This means that leader election is baked into the algorithm which is something that may people simply miss but it is explicitly covered in the original papers as described here.
- A node acknowledged as the leader may discover the partially complete work of another nodes which thought that they needed to lead and which may currently be unreachable. If it attempted to clobber that work in its commit messages, and another leader suddenly reappeared on the network issuing racing commit messages, there could be race conditions and inconsistencies. Rather than attempt to clobber the possibly delayed commit work of another temporary unreachable leader, the newly acknowledged leader opts to collaborate, and commits the event payload proposed by the previously acknowledged leader.
Re-read that last point again as it is the “magic”. IMHO once you understand that last simple concept you comprehend the essence of the algorithm. I think the collaboration at the heart of the algorithm is awesome. High fives and big props to its inventor Leslie Lamport. The actual mechanical steps of the algorithm, or the mathematical proof that it actually all works, may or may not be beautiful in your eyes depending on your aesthetic sensibilities. Yet the fact that it was thought up a quarter of century ago, a proof has been given, and studied extensively, we can all agree is jolly helpful.
Second up: What’s wrong with Paxos and what doesn’t it do which may be a turn-off or hassle to implementors? What crimes can it be accused of?
- Most flavours of Paxos won’t work if nodes have bugs such that they tell lies about what they had previously seen or stated (which are rather cryptically referred to as Byzantian failures). Bugs are, well, bugs and invalidate your warranty.
- It won’t work if nodes crash then recover without making durable to disk which proposals were seen and what events were committed. There may be other algorithms optimised for in-memory only which may be more competitive for your application. Only periodically using fsync to force the disk write of the commit log is a strategy used by NoSQL databases which use eventual consistency not strong consistency. Paxos can only guarantee strong consistency under crashes if you force the disk buffer to flush. Flushing the disk is expensive so you may need to batch commands into groups with on accept per group to achieve high volume throughput. In effect you can not have both low latency and high throughput beyond what the disk offers you. Of course these days you don’t need to use an old fashioned disk you can use durable memory.
- Paxos cannot make progress if no majority of nodes can reliably communicate with one another as it relies on consensus. Paxos is designed to give consistency under network partitions. CAP theorem says you cannot have availability (make progress) during certain network disruptions. As Paxos is based on collaboration not an authoritative master one particular failure to make progress is duelling leaders which resembles a form of live-lock. Since this a well known failure mode the literature has plenty of recommendations about how break a leadership fight such as randomised timeouts or backing off. Those are simple techniques that are easy to implement. With random timeouts and backoffs you can be sure your cluster will failover to a new leader in a matter of seconds. Yet you cannot guarantee it will happen with low latency. If you need a low-latency recovery you can of course add a custom leader election mechanism that is run before any node attempts to issue a prepare message. For example see this paper on pre-voting with Paxos.
- The original specification of the algorithm doesn’t cover operational matters like changes of cluster membership or how to catch up a node which has been unreachable for a long while. Dynamic cluster membership is actually not that hard as it can be simply managed through the consensus algorithm itself so you “layer it over the top”. This is mentioned at the end of Paxos Made Simple.
- It doesn’t have a dominant master so when the network goes haywire in production and there are lots of exceptions kicked back to clients you are going to have your work cut out convincing yourself everything behaved as expected. If you are looking for bugs during crash testing you are going to have to aggregate different logs and will need tooling to check no clients were given stale reads, false positive write acks, or false negative nacks.
There ends the opening arguments of the prosecution against the accused Paxos Algorithm. Now let’s hear from the prosecutions celebrity witness; Google. This upstanding and community oriented witness is well respected globally as a home for technocratic gurus. They, specifically the Google Chubby team, assert as testimony against the defendant that building and testing a reliable fault tolerant database is hard. The judge stares. One of the jurors coughs. The press gallery hangs on every word desperate to hear the dirt on the defendant to splash across the news headlines.
Specifically they mention writing a state machine complier. That helped them expand the messages and states the servers go through as they added features to operationalise their system. They mention that they had to test the heck out of it using randomised yet repeatable tests to find bugs. These days you can use opensoure tools to do the randomised testing. Those modern tools have proven helpful at finding bugs in distributed databases which try to use many alternatives to Paxos to achieve different levels of performance, availability or consistency across nodes. What no one is going to say is that the mechanical process of the Paxos algorithm is hard. Subtle, yes. Devious to prove, yes. Subject to duelling leaders, yes. Yet it is very well studied and mechanically very simple. You simply have to implement the algorithm as defined by Lamport’s TLA+ spec.
The defence attorney begins her cross-examination. “So are you saying Paxos deceived you into thinking it should be straight forward then made it too hard?”, “Well…”, says the witness, “I wouldn’t say it deceived us. We are really smart guys with lots of experience not some bunch of students”. “Yet you now think you would have been better off using 2 or 3 phase commit?”, asks the attorney. “No, definitely not, we know all about that and that’s what traditional databases use.” replies the witness. “But isn’t it standard practice to use only proven algorithms when writing critical software infrastructure? You were amongst the first to publish experience of Paxos so you were an early adopter and were taking a risk; isn’t that simply reckless?”, asks the attorney. “No certainly not! We had an existing 3rd party database and it couldn’t cope with the network partitions. We needed state of the art and that was Paxos.”
“So what exactly is your complaint?” asks the defence attorney. “Well…”, says the witness, “we are really smart guys, and it took us a long time, lots more iterations and releases than we had planned for. Our boss got angry and shouted in meetings, but we had to do all this extra work to write a finite state machine compiler, and this really cool distributed, randomised, repeatable testing tool to run on multiple servers concurrently. That’s a lot of work. When you use a new language you get new tools like a compiler but we had to write our own stuff to use the new algorithm. We think folks doing fault tolerant research should be doing that sort of thing so we don’t have to. In the end though we nailed it and we only had problems when ops messed up deployments. We replaced them with a small shell script so now we have no problems.” concludes the prosecution witness. “Thank you, no further questions” says the defence attorney. The prosecution attorney frowns and shuffles his paper.
It seems the real villain is that the state space that a distributed database can go through during network partitions is extremely complex and it is hard to reason about no matter how simple your design. Experts assert that distributed systems tend to need actual, not simulated, distribution to flush out their bugs. Many database products which Jepsen has proved broken don’t use an algorithm which can be proven to work under network partitions. Many database products which Jepsen has shown to be broken do use proven algorithms but they had bugs. People would think the Google Chubby team lucky to have Paxos such that they didn’t have a cycle of redesigns due to design flaws as they invent and evolved an equivalent algorithm. Such invention would have just as much debugging and testing with the additional cost of invention. Also it risks a flawed design as it would not have a peer reviewed proof of correctness. If Jepsen had existed they could have used that; database development teams now test with it rather than release software with bugs.
After the alleged victim testimony, which was so disastrous for the prosecution’s case, they call their expert witness; Stanford University. They, specifically the Stanford Raft team, assert as testimony against the defendant that students taught lectures about Paxos and their alternative Raft scored higher on tests about Raft than Paxos. A loud murmur rises from crowed jammed into the public gallery. “Order! Order!” shouts the judge, banging his gravel on his desk.
The Raft paper has some great ideas about managing cluster memberships using “joint consensus”. It has some great information about the use of randomised timeouts when trying to select a new leader. Where it drops the ball is to try to make comprehensibility of the core algorithm a key discriminator when selecting a design for a fault tolerant distributed database without presenting any convincing evidence to back up their position.
They do not demonstrate that the measure they used, students doing tests, is at all correlated to making a sound fault tolerant database. Maybe every student who scored higher on the Raft test is incapable of making a robust system and only the ones who scored higher on the Paxos test have the mustard to finish the job properly? Sounds a very unlikely idea; but my point is the Raft paper implicitly implies the opposite which is just as unlikely and unproved.
What is disingenuous about the Raft paper is the work they could have done which they didn’t do to prove a point about the utility of the algorithm as opposed to perceptions about its subtlety. They could have written a chassis of a simple system (with, say, a binary log API and message passing API), then had two sets if students implement a simple distributed database using the two algorithms, then found statistically more or less lines of code, or statistically more or less bugs, then ran randomised network partitions simulations to see if the students solutions had protocol bugs. Is that because Paxos is a compact and mechanically simple algorithm so it would score well on such objective measures? So the attack is that it is “subtle” and somehow thats the big problem? Isn’t that irrelevant unless you provide evidence that this affects utility?
Had they done like-for-like objective comparisons then perhaps they could have performance tested two like-for-like systems and demonstrated that theirs was faster as it was optimised for the use case? Wouldn’t that have been a better way to get people to use your new design than throw FUD at the established algorithm? Is it just poor research or is it also dubious competitive sales tactics?
The most sloppy statement of the Raft paper is that “Paxos architecture is a poor one for building practical systems”. Huh? Paxos is a consensus algorithm not an architecture. Actually it’s a family of consensus algorithms; Basic Paxos, Multi Paxos, Cheap Paxos, Fast Paxos, Collapsed Multi Paxos, cousins, and close friends all packed into court this day. What factual evidence do they cite for this claim which puts old Grandpa Paxos on trial? The Google Chubby team paper who just gave witness evidence!
The Raft paper claims that the typical experience of Paxos is poor and don’t cite many other examples to demonstrate that the Google Chubby experience was typical. Like a salesman telling you people really struggle with the incumbent product as it doesn’t have the features his new product has. Yet he doesn’t give you many references to check it is indeed typical yourself. He only quotes a single publicised case where his statement seems to fit in with some bad press.
The defence attorney paces in front of the jurors as she cross-examines the prosecutions expert witness: “I smell a rat here. You seem to have heard the so called ‘victims’ testimony, and then fitted up thin evidence and a weak theory to blame poor Paxos; because you want to promote your work as the heir to the throne of best distributed database architecture. Do you really confuse an algorithm for a software architecture? Or is it just a convenient dishonesty?”. “Objection!” yells the prosecution attorney. “Sustained” says the judge. “Withdrawn” says the defence attorney but she knows the truth of it cannot be easily washed from the jurors minds. She continues: “Have you applied for research grant or a job at Google?”. “Objection! Objection!” yells the prosecution but the defence attorney is already walking back to her seat with a look of grim satisfaction on her face.
When picking any design to is a very good idea to use the KISS principle. Yet the core algorithm becomes a tiny factor when building something as complex as a fault tolerant distributed database. You are going to face extreme complexity as it is inherent in the problem space. The smart money would recommend an established algorithm with a proven track record over a new one which appears to score better on the KISS scale. Perhaps that is why the Raft paper takes a wild swing at Paxos? Yet when it comes to keeping-it-simple the mechanics of the Paxos algorithm are very simple. So just what is it that the Raft team is accusing Paxos of? The accusations seem to be that it is “subtle”. To me that’s a bad thing if it is easy to have a logical bug in your implementation which is hard to pick up in a code review. Yet Paxos is mechanically a very simple algorithm to code. It comes with a TLA+ specification that you can check your code against. It also has a proof that it is robust against everything real life can throw at your distributed system. So is it that the mathematical proof is subtle? Raft comes with a mathematical proof also. I would go for something which takes genius to invent which is mechanically very simple over anything else every time, period.
Next the defence call their expert witness: Jepsen Knossos. The court is all a murmur as the living legend takes the stand wearing mirrored shades and looking like a super star DJ. “Just how do you go about debunking the data safety of so many distributed databases?” asks the defence attorney. “It’s no secret. It all goes onto the wrap sheet when I send them down. I read their documentation to see what algorithms and CAP theorem characteristics they claim, devise a test client which can run against many nodes in a cluster, and run a randomised test applying some nemesis routines. The nemesis routines isolate nodes, or groups of nodes, or create other random failure-recover scenarios. Then I run the numbers on the responses seen at the client to see if the results are linearisable” replies the detective.
“And what types of problems do you find?” asks the defence attorney “Well design flaws are the worse kind made by bozos who think CAP theorem doesn’t apply to them; like students who failed the final exam. Then there are speed freaks who cannot put down the glass pipe and have to tweak a safe algorithm until they break it. Sometimes you see lairs who’s marketing enthusiasm doesn’t match the code they got around to writing. Often its naïfs who think a network error or timeout means an operation didn’t happen when it’s an indeterminate result. Then finally just the regular schmucks who think their unit tests are going to save their hide in a cage fight with a nemesis.” is the calm reply in a deathly quiet court room.
“What are your thoughts about the Google testimony?” asks the defence. “Well”, says the expert witness, “any landing you can walk away from is a good landing, when you are trying to build a highly available, fault tolerant locking service, which can survive loosing multiple node, and still be consistent, and make progress. That’s hard work to build and hard work to test. Of course these days you can just hire me to find the bugs for you. My tools are opensource but the talent has bills to pay. My number is in the phone directory so Call Me Maybe“.
Jepsen has proven repeatedly that simple designs with bugs show massive data loss during temporary network partitions. You need rigorous testing across the failure space of your cluster. You need expensive verification as described by the Chubby paper or as coded into Knossos. Anyone investing that much effort should do R&D on a couple of alternative designs and test both robustly; not pick one based on a beauty pageant judged by the students in the Raft paper.
Finally: Paxos Algorithm is free to leave the court cleared of all charges. Does that make it a magic solution to all distributed computing problems? Is it superior to algorithms and approaches invented decades after it? Hardly likely. All it is is a cute peer-to-peer and collaborative algorithm for distributed consensus on an ordering of events. It was made before the world knew that was possible on commodity hardware. Good luck to Raft as it seems really promising and pragmatic but it’s never, ever, going to be as cool, elegant or as original as Paxos is in my eyes.
Update: In 2019 Leslie Lamport gave a lecture in this video where he walks through the TLA+ spec for Paxos. Watch that first if you are ever considering implementing the algorithm. Sadly there is some negative debate in the two SPTDC 2019 lecture videos that Paxos “uses magic” or should be explained “round-by-round in English”. In that lecture and the Raft paper people are expressing that they don’t like not having a simple intuative explation to Paxos. Yet to me there is one: any new leader must collaborate by completing the work of dead leader. Overlapping quorums ensure that the new leader sees the incomplete work of the old leader. Ballot numbers act as a “distributed clock” to order Propose messages at every node. This ensures the invariant that once a value is chosen it cannot be changed. It takes genius to invent it and prove it is correct. Yet it is a mechanically simple and very elegant little algorithm that can be comprehended by the average engineer.