Corfu: Safety Techniques

by simbo1905

This the sixth post in a series about Corfu. It will cover some of the basic safety techniques which I call “Write Once”, “Write-Ordering” and “Collaborating Threads”. What is remarkable about Corfu is how it uses such well-established techniques to build something extremely scalable.

Corfu lets multiple clients attempt to write to the same position in a log file with minimal coordination. Clients are application servers and we may have hundreds of them servicing thousands of end users. If we only have minimal coordination of concurrent writes how is this made safe? Corfu enforces “write-once” semantics.

The place where this comes into its own is when a Corfu client has a sequencer token to write at to the end of the log but either crashes, stalls or looses connectivity. The FLP impossibility result tells us that other clients cannot know what has gone wrong. Any other client needs to assume that the holder of the sequencer token has permanently crashed. To fix this they attempt to write a no-operation value into the log at that position. This then gives us a race condition between multiple processes. If the holder of the token only stalled it may resume and be racing multiple correcting processes attempting to write a no-operation value.

Corfu handles the race condition by making each log position “write-once”. Strictly it makes them strictly “write-twice” as a position starts out empty, is written, and then may optionally be deleted. We only need a couple of bits for each log position to model those states. A storage unit simply needs to check the bits to respond with err_unwritten, err_written or err_deleted  as appropriate for any out of sequence modifications.

We should note that a number of good NoSql systems cannot easily enforce write-once semantics. Instead under a network partition you can get multiple conflicting writes under the same key. When the split brain heals typically the conficting writes are resolved using a timestamp. Whilst such systems have special data structures or more expensive write modes to try to avoid that situation they tend not to be the default. So you don’t get the advetised performance when using the safe write modes to achieve strong consistency. Corfu aims for strong consistency by default whilst still obtaining high performance. The last post in this series will discuss how.

The next safety technique used in Corfu is “write ordering”. When writing a value the clients write to mirrored disks in a replica set for resilience. If clients could write to the disks in any order we risk a corruption. Two clients could write different values to the same position on different disks in the replica set. The resolution to this is to write to the disks in a consistent order. The first disk in the chain will enforce “write once” semantics and we avoid the problem. The cost is that we must write sequentially rather than in parallel.

When writing to multiple disks in a replica set we have to consider what happens if the writer stalls or crashes before it has written to all disks. Corfu handles this by having readers collaborate with a possibly crashed writer. They copy any value written to the first disk to all disks in the replica set. If the original writer resumes we have a harmless race condition. One of the writers will fix up the consistency of the system.

We only have one corner case which to address in the next post. It may be the case that the disk in question is no-longer part of the cluster. That can lead to a false positive err_unwritten. We will get to that in the next post where we get back to Paxos.