Corfu: Linearizable Consistency

by simbo1905

This the second post in a series about Corfu. To quote the paper

The setting for CORFU is a data center with a large number of application servers (which we call clients) and a cluster of storage units. Our goal is to provide applications running on the clients with a shared log abstraction implemented over the storage cluster.

Why is a shared log abstraction exposed to clients useful? This post will discuss how a shared log can be used to achieve linearizable consistency. To quote the paper:

The semantics provided for read/write operations is linearizability [Herlihy and Wing 1990], which for a log means that once a log entry is fully written by a client or has been read, any future read will see the same content (unless it is reclaimed via a [delete] operation).

We will then look at a simple adaption that we will later see can be used to scale the system which introduces an additional challenge to solve.

Consider multiple processes implementing a key-value store. Let’s imagine that each process is connected to a shared global disk and that the disk implements an atomic append-only file API. Given these circumstances achieving linearizable consistency is very easy. Every process can append their reads and writes to the end of a shared global log file. Each process then simply follows the shared global log file. All processes will then see all writes linearly sequenced in a single ordering. To get strongly consistent reads we can simply sequence read commands into the log which are only executed by the writing process. This is basically a home run when it comes to strong consistency.

Note that this basic approach achieves strong consistency without any complex logic. It also doesn’t need a leader process. The only thing we needed was an expensive and fast network drive. All processes can directly use a simple append-only file API in an uncoordinated manner. The shared global disk simply needs to serialize all operations and append them to the file. The problem with this simple approach is that a global shared disk is an IO bottleneck. From a scalability perspective this isn’t going to fly.

In order to scale this approach and move to commodity server hardware we first need to make a small modification. Rather than having an append-only file API we need to switch to writing to a specific location just beyond the current end of the shared log file. Why? Well in a later post we will stripe the logical file across the local disks of many independent commodity servers. An appending API would require us to collect and order writes within a global buffer spanning many machines. That would introduce network IO and a single buffer bottleneck that we want to avoid.

Instead imagine that each process is reading a shared log file sequentially up to the last write made. When a process wants to write to the end of the log it writes data to the next highest position beyond the last value read. In this scenario we need to prevent two processes writing to the same position. This isn’t hard. We simply apply the first write that targets a given position in the file and reject any subsequent writes to that position. We only need one bit of additional storage to keep track of whether a position has been written to.

The challenge with what has been described so far is contention. A client that gets a rejected write can simply reread until it finds the new end of the file. Then it can retry its pending write at the next higher position. If we have a system which has very few writes this might be acceptable. Within a very high-performance system with plenty of concurrent writes the performance would be terrible. The next post will introduce discuss the simple technique which Corfu uses to solve this problem.