Just say NO to custom hardware for Paxos 

by simbo1905

Today’s Morning Paper on “Just say NO to Paxos overhead: replacing consensus with network ordering” was a thrilling disappointment. It’s always with both excitement and trepidation I read about new developments in distributed consensus; is today the day that I learn that Paxos is obsolete? The NOPaxos paper (“network ordered Paxos”) reviewed at the link above has a title which suggests a breakthrough. Unfortunately, it has far less general applicability, and far higher economic cost to implement, than “vanilla” Paxos.

The big selling point for NOPaxas comes from the performance charts near the end of the blog post below this statement:

NOPaxos itself achieves the theoretical minimum latency and maximum throughput: it can execute operations in one round trip from client to replicas, and does not require replicas to coordinate on requests.

The charts show that NOPaxos operates at within 2% of a none replicated solution. That totally blows away other Paxos results. On face value that looks to be a game changer. The problem is that to get those blistering results they had to nitro their not-so-secret sauce of “ordered unreliable multicast (OUM)” using a network appliance. This isn’t the only drawback when you read between the lines as it optimises for speed at the expensive of other generally desirable properties; but to me modifying the properties of the network is a deal breaker in a vast majority of corporate environments.

First a quick outline of OUM. Basically, you pass all messages through a single device and have it apply a global ordering to all messages. Why? So that all nodes can trivially process messages in the same order by respecting the global sequence. Okay that doesn’t sound too hard to comprehend, is easy to implement, and easy to test, so it sounds like a reasonable technique. It is even easy to intuitively understand why consensus with global ordering makes strong consistency faster. So what’s my objection to this apparently down to earth approach?

Well, it gives us a bottleneck and a single point of failure. “No problem”, say the authors, “you can do it in hardware”. They make the observation that you typically pass huge volumes of messages through a hierarchy of switches and the apex switches are not considered a bottleneck. Core switches can fail but rarely do and have well-established redundancy and failover. Switches are a bottleneck and potential point of failure today yet are trusted technology. New switch technology is coming on the market to do custom packet labelling within them. With a little imagination, one can see that a FPGA could easily write a 64bit counter into packets. Ergo put the global sequencing into the apex switch. Sorted? Good luck with that.

Update: Another concern I have is that state of the art for cloud networks are CLOS none-blocking networks were traffic in a data centre goes via a constellation of core network switches chosen at random for each TCP connection for fault tolerance and increased bandwidth. To route traffic to a single or chosen set of core switches that do specific ordering in a consistent manner may run totally against how cloud networks are currently optimised.

Why is this a problem? It isn’t in the slightest bit convenient in any big IT shop to get a setup where you can put custom logic into shared networking gear. Try asking to turn on this feature within a switch used by other, possibly many, existing applications in production. The owners of all those applications will be asked to sign-off on your request. Dream on.

You can expect to have to explain and rule out absolutely every other possible alternative, and every other possible design of your system before you get approval. Oh, but we can avoid that by buying, installing and running our own new kit? Dream on.

Buying your own hardware switches, or buying a dedicated network appliance, is counter to strategic IT policy in many large IT shops these days. I have worked in global firms that have multiple private data centres that are full of hardware. Clusters of ten thousand cores use to crunch data are normal in such organisations. Yet the amount of data and growth in the demand for computing power means they simply cannot afford to expand data centre capacity. Demand for space and power will always outstrip supply. They are trying to move to software-defined everything and Linux containers to increase density. New hardware requests required senior executive management sign-off years ago. These days they won’t even have an official mechanism for you to request buying hardware. Every chassis has been full for years and when any hardware is being replaced the policy is to buy generic high-density blades hosting software-defined platforms which in turn host software defined distributed applications.

I have worked with a different type of well-moneyed IT savvy organisation which has more prosaic technology needs. A fundamental requirement of getting any spend approval was that you would be deploying into the public cloud in a way that didn’t have any vendor lock-in. They already have large systems deployed into each major cloud and the state of the art for them is cross-cloud. Use of stock IaaS is not negotiable to get approval to build a solution. If someone suggested any spend on custom hardware the response would be very fast, very short, very negative and likely to contain an expletive.

I simply don’t know of any large organisations that have distributed systems needs where using a hardware solution isn’t counter to IT strategy to drive down costs. If small and medium-size firms don’t have such an IT strategy today I think it’s very likely they will in the coming years. Yes, low latency trading boutiques use custom hardware. Yes, Google installs atomic clocks on servers to solve some of its global database replication problems; yes I am sure they could take a hardware approach to build a better faster Chubby. But those exceptions prove the rule; only “special” firms that have a team of Wookies building a private Millennium Falcon see custom hardware as a viable solution.

Okay so if not hardware then software? A nice little Go app running on a stock container? Come on be serious. You are now a million miles away from “network switch” levels of reliability and performance. Running the sequencer in software on stock IaaS is likely to have an extremely long tail of latency under load and other activity happening on the same underlying hardware and shared network. You will also need a HA pair and automatic failover. You can use the Heartbeat project and other stock tools to do a fail-over. Remember though that this isn’t a stateless setup. You need to ensure that if the active sequencer dies and the standby sequencer takes over it doesn’t repeat the sequence numbers. Okay, you can use a timestamp of take-over and a local counter. Then you need to worry about clock skew. Google deploys atomic clocks of that but remember, you are not a Wookie and so you don’t install custom hardware. So you need to use a leader lease that is large enough to cover for clock skew which extends the length of an outage.

Oh and don’t forget to worry about split brains. The FLP impossibility result means that the slave node doesn’t know the whether the master is dead or that messages are being dropped. Traditionally you deploy a STONITH approach. Yesteryear that would have been remote controlled power switch for the node that wants to become the leader to power cycle the other node to force it to switch to being a slave. Look in the mirror and say three times “I am not a Wookie”.

All of the HA stuff in the previous paragraph are the type of things that network kit does in hardware. You should avoid trying to mimic it in software, on software-defined servers, in a software-defined network, where you need to carefully consider what happens when messages get delayed or dropped. Today it is simply good IT strategy to have cattle, not pets. A stateful HA pair as part of any solution is a first-class example of a pet solution. No-one should welcome a design which needs such a thing if you can possibly re-engineer the system not to need it. The future is sheep, not cattle, such as Kubernetes orchestrated containers.

Finally consider that Paxos was such an innovation exactly because it didn’t need custom hardware and is proven to be safe to run over cheap, redundant, asynchronous and unreliable networks. Prior to Paxos, it was believed that you needed a very reliable networking solution to achieve consensus. Back then everything deployed was on dedicated hardware. Almost 20 years later we are proficiently building very large distributed systems on virtualized IaaS using virtualised networks on generic hardware. From both the historical perspective and current good IT practice it simply isn’t a step forward to suggest using network-based sequencing to boost Paxos. It is really taking two steps backwards to introduce a global sequencer which is another point of failure and which needs a load of effort to make highly available. Thanks, but no thanks.

Edit: Corfu uses a global sequencer to avoid contention. Yet it can run on vanilla hardware and uses very simple techniques to achieve extreme performance of 200k 4K append per second. Yet at its core, it uses vanilla Paxos to distribute a strong configuration as a “black box” component. Corfu’s blistering throughput as a globally ordered distributed log makes the idea of Network Ordered Paxos obsolete for most of the applications where you might be looking to “make Paxos run faster”.