TRex: A Paxos Replication Engine (Part 1)

by simbo1905

A previous post laid out how the Paxos parliament algorithm as described in the 2001 paper Paxos Made Simple can be applied to state replication across a cluster of servers.  This post is the first part of a series that will give an overview of an embeddable Paxos state replication engine called TRex. TRex is a replication engine, not a dinosaur, implemented in Scala. It can be easily layered over the top of an existing service to provide strongly consistent replicas with automatic failover to achieve fault tolerance. Alternatively, you can bake the low-level library code into your application and use your own customised IO logic to give peak performance. You can fork the accompanying code over on GitHub. Fork me on GitHub

Overview

A full description of the techniques used in TRex is given in the post Cluster Replication With Paxos. The main features of TRex are:

  • Provides Paxos-based replication as an embeddable messaging layer surviving F failed nodes with automatic primary server failover when used in a cluster of 2F+1 nodes
  • Strongly consistent majority writes with automatic node catch-up. In the future, Trex will support both scalable weak consistency replica reads and strong consistency replica hand-off reads
  • A library module containing only the algorithm modeled as immutable message value objects, algebraic data types for node state, and pure functions, all covered by extensive unit tests
  • A “turn-key” embeddable server module using the library with pluggable durability and pluggable network capabilities. This may be re-used wholesale to create a quick cluster else customised to meet the specific needs of the host application protocols and stack

The following two diagrams show how TRex may be layered into an existing application architecture. The first diagram sketches an arbitrary pre-existing generic service. It is deliberately vague as clients could be fat applications, single page web applications or dependent micro-services, or mixtures of them using any network protocols or frameworks. The salient point is that there is a single server without fault tolerance and without any automatic service failover to a replica if the single server locked up, crashed or lost network connectivity.

trex1a1The second diagram below shows how TRex can be layered into the pre-existing application to create a replicated cluster running the service on several machines. Physically TRex is library code that runs in the same JVM as each replicated copy of the server service. Logically TRex intermediates between the clients and the pre-existing service to replicate the client traffic across the cluster over UDP.

TRex transparently elects a distinguished leader and the TRex client driver routes traffic to that node. Only when each piece of client traffic has been acknowledged by a majority of nodes are they delivered in-memory to each replicated copy of the service. If the leader goes dark the cluster automatically fails over to a replica and the client driver automatically routes traffic to the new leader. If the failed leader then comes back online it automatically rejoins the cluster as a replica and catches up with the new leader.

trex1a2

As TRex is an implementation of the Paxos Parliament protocol it is safe to messages being lost, repeated or re-ordered, and cluster consistency is guaranteed even under arbitrary network partitions.  This makes low overhead UDP an acceptable protocol for intra-cluster replication. As long as a majority of nodes are able to establish connectivity long enough to elect a leader the service will be available.

Will TRex Paxos Work For You?

There is no known algorithm which performs faster than Paxos whilst giving it’s safety guarantees. Yet there has been a perception that Paxos may be too slow for replication work loads. Many people may be surprised to learn that the additional consistency guarantees of Paxos comes with relatively little overhead. The Gaios paper presents evidence that any performance concerns of the algorithm are unfounded. Gaios uses Paxos to create a replicated filesystem with strongly ordered replica reads. The Spinnaker paper also dispels the performance myth by forking Cassandra and swapping in Paxos for replication; measuring about 5%-10% increase in overhead for writes but faster strong reads. High-performance applications can choose to use the core algorithm library of TRex which only implements Paxos. The application logic itself can provide custom IO code to get optimal performance.

When the host application uses optimistic locking or compare-and-swap semantics there is no safety concerns with client reads bypassing Paxos. You can trivially read committed data directly from the replicas. If the host service semantics are not safe to stale reads then the read work can be offloaded to replicas via the master. This can be done whilst maintaining strict linearizability with respect to writes even under network partitions by using hand-off strong reads. These two techniques let you offload your read load across the replicas improving the scalability of an existing read-dominated service.

Of course running all your data through a Paxos engine for bullet-proof safety may not be necessary. Many applications only need some data, in particular, meta-data such as shared configuration in within a cluster, to be durable on a majority of nodes. When such data has low write volumes and is read often even an unoptimized Paxos implementation should work well.  The typical pattern is to “outsource” strong consistency of such meta-data to an external coordinator service. That isn’t a silver bullet as the extra network hops and opportunities for partitions, along with bugs in correctly observing the external state over a failing network, can lead to split-brains, lock-ups and other production outages. Insourcing the strong consistency by embedding a Paxos engine for your meta-data may be a good option for you. Applications that want to go beyond low volume writes can scale-up and scale-out with techniques like a fast dedicated journal disks, key sharding, and write batching.

Use Cases

An immediate use case satisfied by TRex is to take a single service and replicate it across a cluster of three or five nodes so that the service can remain up during the failure of either one or two nodes. If you have a micro-service which is used by many other services such that it is critical to your platform, say a user authorisation micro-service, you can wrap it with TRex and get instant “turn key” high availability. When you want to update your service simply do a rolling restart and bounce the leader last. This will induce a fail-over onto a replica running the upgraded code. Doing deployments as an induced failover gives you higher confidence that the cluster fail-over mechanism works. A near-term target of TRex is to be able to distribute read-only load across a cluster to give scale-out reads in addition to crash safe majority writes.

A more advanced use case is to use TRex to remove a dependency on an external cluster configuration service, locking service or leader election service. The problem with any “consistency-as-a-service” architectural pattern is that lock-ups and split-brains can easily be introduced with bugs on the application code paths observing the external high consistency service. Having encountered such production outages this author considers consistency-as-a-service to be an anti-pattern. By embedding TRex into your application you can “in-source” rather than “out-source” consistency to provide embedded strongly consistent meta-data and distributed locks. Strong consistency is a capability upon which you can build advanced capabilities such as leader elections. Removing the external dependency both simplifies your deployment topography and reduces the number of external moving parts where the bugs and network glitches can trash your system.

An even more advanced use case is to use TRex as a strongly consistent core to bootstrap eventual consistency engines. TRex can handle atomic broadcast of ring membership and other low volume cluster-private meta-data and off the back of that your cluster can use weaker eventual consistency algorithms to handle client work. There is no reason not to have only five or seven nodes of a larger cluster be running Paxos to maintain strongly consistent meta-data over UDP broadcast whilst having an arbitrary number of additional nodes listen to the meta-data traffic. In this manner, you can scale out to dozens or hundreds of nodes sharing strongly consistent meta-data that handle client workload using an eventually consistent algorithm.

Caveat Emptor

As at late 2015 it is typical for leading NoSQL servers to not force the disk for majority acknowledged writes (I am looking at you Mongo and Cassandra) so they will lose data under correlated crashes. Only when you actually carefully configure them not to loose data with a forced disk majority write are they actually safe; at which point they are doing the same IO work as a Paxos implementation and performance will be comparable. You should test extensively to see whether your server can perform as advertised even on unsafe single node writes once you have configured it to force the disk which is required for true safety on majority writes.

As per the TRex license the software is supplied ‘as-is’ without any warranty either express nor implied. The functional library is considered complete. At the time this blog post some obvious features are missing from the embeddable server such as dynamic cluster membership. You can check the roadmap statuses on the README.md else feel free to reach out on gitter to enquire about the latest status.