Agile and Domain Driven Design (Part 4): Microservice Sagas

The last post discussed breaking up a digital service into bounded contexts that can be enhanced by small autonomous teams. A microservice architecture allows teams to focus on domain driven design by creating aggregate entities that enforce the invariants of a business. The catch to such autonomy is the price of needing to run a distributed system. Technologies such as Kubernetes and a Service Mesh help. Yet there remains an additional complexity that you have to abandon global transactions and the “I” of ACID updates. You also need to use asynchronous messaging and deal with eventual consistency. Why?

The first issue is the laws of probability don’t allow you to use synchronous updates across multiple services. If you aim for 99.9% uptime for each service that is 1m 26s down time a day. If a screen needs to make a dozen interservice calls to gather data then the synchronous success rate will be (99.9)^12 which is only 98.8% uptime. That corresponds to 17m 16s downtime a day. Yet that is idealistic maths and you won’t achieve that without a lot of hard work. The reality is that if you are under high load and start to have lost requests users are going to hammer the refresh button and you are going to have cascading failures. Your success rate will drop to 0% until the load goes away. The book ”The Art Of Scalability” has some real world horror stories of live sites crashing for hours or days under load. You might be lucky such that this happens rarely and over the long period it averages out to high uptime. Defensive coding like load shedding and circuit breakers can help claw you back up to the theoretical limit. Yet every failed synchronous unit of work is lost revenue if the user closes their app due to errors and timeouts.

The get out of jail card is asynchronous messaging between microservice. Your success rate can then be the uptime of a cluster of sharded message brokers. The price you pay is four-fold:

  1. You will need to do multiple asynchronous reads and writes to action client commands. Debugging a set of asynchronous reads that led to a bad write is hard. You should deploy a distributed tracing solution which needs its own infrastructure.
  2. You cannot have an atomic global transaction across services. Writes that span services will use a local transaction per service. When the last write fails, you will need to back out each prior write. You will need to write roll-back logic that sends compensation commands to undo partially completed work.
  3. You cannot have the ACID properties that monoliths achieve when using relational databases. When writes span multiple services any interleaved reads will see ”partial work”. Partial work may lead to users experiencing transient “anomalies”. The best you can achieve is eventual consistency where all the local transactions succeeded or failed. If your UI crashes when it encounters partially completed work you are going to suffer.
  4. You need to use message broker infrastructure which adds operational complexity.

You need to use defensive coding that gracefully deals with seeing incomplete work. A user who pushes the “buy now” button expects to then see that they have bought the product. If you have the UI query the last of a dozen asynchronous operations needed to action the “buy now” command then under load users wont see their own pending asynchronous writes. You need to be assertive at the UI and confirm to the user that their order has been received on the first synchronous call. You must have robust coordination code so that you can be confident that the happy path will work. The business logic might take several seconds, or even several minutes, to weed out the few bad orders. The relatively few unhappy path orders can be dealt with as asynchronous failure notifications back to the client.

Initially this may not seem so bad. Consider the case of a takeaway food ordering application. This appears to only need to make a single atomic write to the food order service to make revenue. Unfortunately, things are not going to be that simple. In an API and “mobile first” world we need to write security conscious code that revalidates the whole scenario when it gets a command from a client. A monolith can use a single database transaction to read the current active state of many disparate rows spanning many business domains to confirm the validity of a write within the same transaction. To action an order you might beed to validate the order, and the customer, and the state of payment, and the availability of fulfilment, and only then write an update. When you are running microservices each read and write is a separate async call running a separate transaction. Then factor in that you might also need to made an additional write to the a reporting or search service. A criticism of ORM with monoliths is that developers are unaware of all the round trips made to the database in typical business processing leading to poor performance. It is “too easy” to write poorly performing code that functions correctly. When you are reading and writing to many microservices the complexity will be very apparent. It is very hard to write correct code and it is also going to be a lot slower than monolithic code. It will also be a lot harder to debug. With microservices it is too easy to write poorly preforming code that is less reliable than monolithic code that reveals anomalies to users that are very hard to debug.

With all those warnings aside how do we perform validating reads and then one of more writes across multiple microservices as a series of local transactions? Well there is no magic pixie dust you are going to have to either write a lot of code else use a sophisticated framework. With a monolith you can start a single transaction, make all your reads and writes, and either commit or throw an exception to rollback. With a microservices you need to create a saga which tracks the series of asynchronous reads and writes to it’s conclusion. More importantly if you need to make multiple writes then you also need to action compensations to reverse out any earlier writes that succeeded when a later write fails. You are likely to experience “dirty reads” of partially completed sagas under load that you don’t experience with monolithic database transactions.

The Microservices Patterns book covers two patterns for sagas. The first is called choreography where there is no central coordinator rather you have a series of messages where the outcome will be a satisfactory result. This seems like a recipe for spending many hours digging through the logs and code of many services to try to debug an unsatisfactory outcome. The second pattern is to have some orchestration logic to coordinate the forward motion and the compensations during exceptional processing.

A saga orchestrator is simply a finite state machine that issues and responds to messages. People seem to find the phrase “finite state machine” intimidating. This is a bit odd as most business UIs are finite state machines: they responds to input by showing outputs and guide a user through a journey. A traditional websapp basket check-out is the canonical finite state machine. It responds to http requests by updating its database and showing http responses. The shopping basket logic is a finite state machine running the saga of a check-out. If you give it the right sequence of requests you complete your purchase. A microservice saga is very similar. It issues messages and updates private state based on the responses. It guides the business process to competition or runs compensations to reverse out writes when later errors are encountered. The differences to a basket check-out webapp is that a saga is initiated by a single user request then it runs robotically. It can also be more active by issuing messages in parallel. It can also automatically retry idempotent messages when it times out on replies.

What asynchronous messaging technology should we use? It is fashionable to run Kafka as a distributed log. Indeed keeping a good chunk of events and commands in Kafka that you can reply on the “read side” without having to resend messages is a phenomenal advantage over traditional messaging brokers. Traditional message brokers have the disadvantages of being a bottleneck and of making it hard to replay messages when you have updated one microservice to fix a bug or to add new functionality you want to retroactively apply to historic data.

The patterns book recommends writing to an outbox database table then then transferring from there to a message broker. This can be done using a service internal process that polls the outbox and pushes to the broker. This avoids having to use a global translation to write to both the DB and topic. Instead you have a dedicated mini-saga that is a database write that includes writing an outbox row followed by an idempotent write to the message broker. Alternatively you can avoid pooling the outbox using SQL by lifting the outbox writes from the database commit log. There is one pattern for asynchronous messaging that I have seen used to good effect but that I have not seen written down anywhere. This is to expose the outbox table via an API that lets you page though messages since the last message you last reead. You can then have a brokerless architecture. If you don’t mind some additional latency then this has all the advantages of consumers being able to independently replay messages. The reader grabs the next batch of messages and can process them under a single database transaction that also records the new high watermark of read messages. While this can add a small amount of latency it can have good throughout. It is also incredibly easy to code and debug and can be implemented very easily as normal API. It is also perfect for disaster recovery. If you corrupt the database of a backend service and restore from a day old backup it will reread and replay the last days messages automatically to catch up: