Corfu: Stripe & Mirror

by simbo1905

This the fourth post in a series about Corfu. The last post introduced a global sequencer to reduce contention on writes to the end of a log file. The idea is to be able to distribute the log file across many machines. In this post we will scale up or fictional example using by striping and mirroring to achieve high performance:

[…] we can append data to the log at the aggregate bandwidth of the cluster […] Moreover, we can support reads at the aggregate cluster bandwidth. Essentially, CORFU’s design decouples ordering from I/O, extracting parallelism from the cluster for all IO while providing single-copy semantics for the shared log.

Lets image that we are still using expensive network disks but that we can afford four of them and each one can support four servers. We can attach four servers with full interconnections to four disks. This allows us to use RAID 10 write semantics for both performance and resilience:

RAID 10 is a stripe across a number of mirrored sets. [HDD Tool]

Each server can run a key-value store process which simply applies `mod(2)` to each position to know which pair of mirrored disks to write to. Data is written to two disks to protect against disk crashes. All we need to ensure is that every process attempting to write to a given logical position writes to the same two disks in the same order Why? Because the disks use a single bit to do a fast ”write once” check. We cannot have the same logical position in the global log mapped onto two different disks. That could lead to data corruption and a violation of the write-once safety semantics.

A really important thing we gain with a “RAID 10”-like stripe and mirror approach is that writes at adjacent positions are mapped onto a different pair or mirrored disks. Two clients with adjacent tickets taken from the global sequencer can concurrently write without contention. Happy days!

Why stop there? Our fictional solution is limited by our expense disks. If we can use commodity servers with local disks as storage nodes then instead of two replica sets of two disks we can use, say, eight replica sets of two disks. Indeed the Corfu paper says that they scaled to sixteen storage nodes. This is because Corfu can use commodity hardware for storage nodes exposing a simple log API over the network. The reference implementation for Corfu is a Java project on Github. This means that the number of mirrored disk pairs that we can concurrently write stripes of the distributed log into is arbitrary. We can stripe each page to another independent replica set to scale up IO.

This approach is a competitive advantage over a shared database design. Apache Cassanda shards by key onto replica sets. That means that under normal load consecutive writes can hit the same replica set. The birthday paradox means that this will bite you more often than you would expect. In contrast Corfu can round-robin writes to consecutive log positions perfectly across all replica sets. This means lower contention so higher throughput.