Corfu: Paxos Configurator
This the seventh post in a series about the techniques used in Corfu. The techniques described this far are the basic mechanics. Corfu uses these to build a linearizable log with one final ingredient: Paxos. To quote the paper:
At its base, CORFU derives its success by scaling Paxos to data center settings and providing a durable engine for totally ordering client requests at tens of Gigabits per second throughput.
It does this by using Paxos to control the cluster configuration whilst using the simpler techniques outlined in this series of posts to scale the main transactional load.
There is quite a lot of different types of configuration in Corfu. The location of the global sequencer may change if its host crashes. Storage units may be swapped out due to failures. Copy compaction can be happening in the background such that ranges of the log are occasionally moved between storage units. The majority of the meta-data will be projections which name the network location of slabs of transactional data. The consistency of Corfu is then bootstrapped and scaled out by making all of this core meta-data consistent with Paxos.
Although Corfu uses a relatively rich amount of data it will be low volume. The paper mentions that it estimates that under normal workloads it only needs 10s of Kb of projection meta-data per Tb of transactional data. This goes up to 25 Mb per Tb under adversarial workloads. To get a flavour of what that imagine deleting every other log position between copy compactions to force a very fine-grained mapping of which log position is where on which disk. Yet even then the volume of mata-data is so small we don’t have to worry about the throughput of a Paxos leader controlling it. Any reasonable implementation of Paxos is likely to suffice:
We do not discuss the details of the actual consensus protocol here, as there is abundant literature on the topic. Our current implementation incorporates a Paxos-like consensus protocol using storage units in place of Paxos-acceptors. [Page 12]
In fact the pseudo code description of the algorithm in the paper has this line to set a new projection `using any black box consensus engine: propose newP and receive decision`.
The key to the performance of Corfu is that clients (application servers) interact directly with storage units in a largely uncoordinated manner. Each client obtains a token from the global sequencer then uses its local copy of cluster configuration to resolve the storage units to write to. This means that clients are Paxos learners. All that we need to do is have storage units reject messages from clients who may have an out-of-date configuration. Corfu achieves that by tagging each client message with a configuration version number. Storage units can then reject messages from clients that are using an old configuration. We simply need to ensure that each storage unit in any given replica set is rejecting the old configuration prior to making the new configuration available to clients.
In the final post in this series, we will slot all the pieces together to describe the Corfu protocol.