slash dev slash null

simbo1905’s ramblings about computers

Category: Paxos

Bolt-on Causal Consistency

Eventual consistency forces complexity onto an application. Consider a comments system on a blog site where users are discussing with each other. What every user would like to see is “causal consistency” whereby they don’t see a comment until they can also see all the comments that it is “in reply-to”.

In the general case an eventually consistent data store (ECDS) like Cassandra won’t give causual consistency: you can see a comment before you see what the user is replying to. The Morning Paper has an excellent discussion of a paper that shows how 2k lines of code can layer Causual Consistency over the top of Cassandra using a separate local data store at each node and a vector clocks to track ordering.

the morning paper

Bolt-on Causal Consistency – Bailis et al. 2013

“It’ll probably be OK” seems to reflect the prevailing application developer’s attitude to working with eventually consistent stores. Thanks to the work of Bailis et al. on PBS, we can now quantify that ‘probably.’ And it looks pretty good at first glance, with 99+% probabilities achievable after a pretty short window. The greater the volume of transactions you’re processing though, the more this bites you: (a) the window between transactions is shorter, increasing the probability of staleness, and (b) you’re taking a percentage of a bigger absolute number. Let’s say 1M transactions per day, and 99.99% probability of recency under normal conditions – that’s 100 stale reads a day. Is that ok? It depends on your application semantics of course. After a year of operation you’ll have 36,500 stale reads – it’ll probably be ok?!

Presumably you’re using an eventually consistent…

View original post 2,320 more words

Understanding Paxos

The excellent paper Understanding Paxos has a very detailed explanation of the mechanics of the algorithm. Here is a diagram from the paper showing a node which bridges a network partition from that paper which then goes and works through the possible outcomes with diagrams showing every message. Excellent work!

Source: Understanding Paxos

TRex: A Paxos Replication Engine (Part 2)

The last post gave a high level overview of TRex an embeddable Paxos engine for the JVM. In this second post we will take a high level walk through a Java demo that wraps a trivial stack to deploy it replicated across a cluster of nodes.  Read the rest of this entry »

TRex: A Paxos Replication Engine (Part 1)

A previous post laid out how the Paxos parliament algorithm as described in the 2001 paper Paxos Made Simple can be applied to state replication across a cluster of servers.  This post is the first part of a series that will give an overview of an embeddable Paxos state replication engine called TRex. TRex is a replication engine, not a dinosaur, implemented in Scala. It can be easily layered over the top of an existing service to provide strongly consistent replicas with automatic failover to achieve fault tolerance. Alternatively, you can bake the low-level library code into your application and use your own customised IO logic to give peak performance. You can fork the accompanying code over on GitHub.  Read the rest of this entry »

Paxos For Master Leases

A topic not yet covered on this blog series on Paxos are leader/master leases. A quick search through The Part-Time Parliament paper for ‘leases’ won’t find anything; you need to search for “cheese inspector”. Read the rest of this entry »

Paxos Uses Leaders (Multi-Paxos is the Parliament Protocol and Basic-Paxos isn’t)

A common misconception about the Paxos Algorithm is that it doesn’t use a leader. With this world view the Paxos algorithm is an enimic peer-to-peer algorithm which is impractical and it has to be extended with a separate flavour called Multi-Paxos to do anything useful. This is a back-to-front world-view which is often put up as a strawman by advocates of aggressively marketed alternatives.  Read the rest of this entry »

TRex: A Paxos Consensus Engine For The JVM

An implementation of multi-Paxos for state replication as outlined in this blog series is now on github. Read the rest of this entry »

Paxos Dynamic Cluster Membership

One practical concern not covered so far in this blog series is how to add or remove nodes from a cluster. This is actually very straightforward and can be handled by the Paxos algorithm itself. The process is outlined in the last paragraph of the paper Paxos Made SimpleRead the rest of this entry »

Paxos And Read Consistency

In the last post we did a deep-dive into gaining consensus in the ordering of writes using Paxos. We explored how this could be used to replicate a stream of commands to create a set of datastore backups with automated failover. We can upgrade that model with compare-and-swap semantics, and add weakly consistent reads from cluster replicas. We can also choose to route reads via the leader yet actually run them on replicas whilst retaining strongly consistent semantics. Read the rest of this entry »

Cluster Replication With Paxos

In my last post I made the case for the defence of the Paxos Algorithm. In this post we will delve into a tough problem which Paxos solves; state replication across a cluster of replicas using atomic broadcast. It turns out that maintaining primary ordering on secondary servers is straightforward to achieve using Paxo without affecting correctness or efficiency. This may be a surprising result to any reader who has looked at ZAB and Raft. You don’t need an external leader election service as used in the Spinnaker or PaxStore papers. You send the highest committed, and highest proposed, log stream sequence counter during leader failover, and introduce optional, ignorable, retransmission requests. The techniques described in this article are used by the Trex Paxos library.

Read the rest of this entry »