Agile and Domain Driven Design (Part 4): Microservice Sagas

by simbo1905

The last post discussed breaking up a digital service into bounded contexts that can be enhanced by small autonomous teams. A microservice architecture allows teams to focus on domain driven design by creating aggregate entities that enforce the invariants of a business. The catch to such autonomy is the price of the operational complexity of running a distributed system. Technologies such as Kubernetes and a Service Mesh help. Yet there remains an additional complexity that you have to abandon global transactions and the “I” of ACID updates. You also need to use asynchronous messaging and deal with eventual consistency. Why?

The first issue is the laws of probability don’t allow you to use synchronous updates across multiple services. If you aim for 99.9% uptime for each service that is 1m 26s down time a day. If a user interface screen need data that comes from a dozen interservice calls then the synchronous success rate will be (99.9)^12 which is only 98.8% uptime. That corresponds to 17m 16s downtime a day. Yet that is idealistic maths and you won’t achieve that without a lot of hard work. The reality is that if you are under high load and start to have lost requests users are going to hammer the refresh button and you are going to have cascading failures. Your success rate will drop to 0% until the load goes away. The book ”The Art Of Scalability” has some real world horror stories of live sites crashing for hours or days under load. You might be lucky such that this happens rarely and over the long period it averages out to high uptime. Defensive coding like load shedding and circuit breakers can help claw you back up to the theoretical limit. Yet every failed synchronous unit of work is lost revenue if the user closes their app due to errors and timeouts.

The get out of jail card is asynchronous messaging between microservice. Your success rate can then be the uptime of a cluster of sharded message brokers. The price you pay is four-fold:

  1. You will need to do multiple asynchronous reads and writes to action client commands. Debugging a set of asynchronous reads that led to a bad database update is hard. You should deploy a distributed tracing solution which needs its own infrastructure.
  2. You cannot have an atomic global transaction across services. Writes that span services will use a local transaction per service. When the last write fails, you will need to back out each prior write. You will need to write roll-back logic that sends compensation commands to undo partially completed work.
  3. You cannot have the ACID properties that monoliths achieve when using relational databases. When writes span multiple services any interleaved reads will see ”partial work”. Partial work may lead to users experiencing transient “anomalies”. The best you can achieve is eventual consistency where all the local transactions succeeded or failed. If your UI crashes when it encounters partially completed work you are going to suffer.
  4. You need to use message broker infrastructure which adds operational complexity.

You need to use defensive coding that gracefully deals with seeing incomplete work.

Under high load and partial failures the business logic might take several seconds, or even several minutes, to weed out the few bad orders. A user who pushes the “buy now” button expects to then see that they have bought the product. If you have eventual consistency at the backend and make synchronous writes that take a long time users will become frustrated and hammer you with retries. It is better to accept the user command in a first request and confirm to them that their order has been received. Then run all the validations and work across all services to validate and action the work asynchronously. The relatively few unhappy path orders can be dealt with as asynchronous failure notifications back to the client.

You must have robust coordination code so that you can be confident that the happy path will work. You must be confident in abandoning the old monolithic ways of doing a single database transaction and two-phase commits against messaging brokers supplying ”at most once” semantics. This requires that the business accept that strong technical guarantees cannot be made and that accepting small error rate is a good trade-off. If you cannot clearly articulate the benefits versus costs of this approach then you should probably not attempt to build such a complex distributed system.

Initially this may not seem so bad. Consider the case of a takeaway food ordering application. This appears to only need to make a single atomic write to the food order service to make revenue. Unfortunately, things are not going to be that simple. In an API and “mobile first” world we need to write security conscious code that revalidates the whole scenario when it gets a command from a client. A monolith can use a single database transaction to read the current active state of many disparate rows spanning many business domains to confirm the validity of a write within the same transaction. To action an order you might beed to validate the order, and the customer, and the state of payment, and the availability of fulfilment, and only then write an update. When you are running microservices each read and write is a separate async call running a separate transaction. Then factor in that you might also need to made an additional write to the a reporting or search service. A criticism of ORM with monoliths is that developers are unaware of all the round trips made to the database in typical business processing leading to poor performance. It is “too easy” to write poorly performing code that functions correctly. When you are reading and writing to many micro-services both the complexity and overheads will be very apparent. It is very hard to write correct code and it is also going to be a lot slower than monolithic code. It will also be a lot harder to debug.

With all those warnings aside how do we perform validating reads and then one of more writes across multiple microservices as a series of local transactions? Well there is no magic pixie dust you are going to have to write code. With a monolith you can start a single transaction, make all your reads and writes, and either commit or throw an exception to rollback. With a microservices you need to create a saga which tracks the series of asynchronous reads and writes through to a business outcome. More importantly if you need to make multiple writes then you also need to code compensations to reverse out any earlier writes that succeeded when a later write fails. You are likely to experience “dirty reads” of partially completed sagas under load that you don’t experience with monolithic database transactions.

The Microservices Patterns book covers two patterns for sagas. The first is called choreography where there is no central coordinator rather you have a series of messages where the outcome will be a satisfactory result. This seems like a recipe for spending many hours digging through the logs and code of many services to try to debug an unsatisfactory outcome. The second pattern is to have some orchestration logic to coordinate the forward motion and the compensations during exceptional processing.

A saga orchestrator is simply a finite state machine that issues and responds to messages. People seem to find the phrase “finite state machine” intimidating. This is a bit odd as most business UIs are finite state machines: they responds to input by showing outputs and guide a user through a journey. A traditional websapp basket check-out is the canonical finite state machine. It responds to http requests by updating its database and showing http responses. The shopping basket logic is a finite state machine running the saga of a check-out. If you give it the right sequence of requests you complete your purchase. A microservice saga is very similar. It issues messages and updates private state based on the responses. It guides the business process to competition or runs compensations to reverse out writes when later errors are encountered. The differences to a basket check-out webapp is that a saga can be more active by issuing messages in parallel. It can also retry idempotent messages when it times out on replies.

What asynchronous messaging technology should we use? It is fashionable to run Kafka as a distributed log. Indeed keeping a good chunk of events and commands in Kafka that you can replay without having to resend messages is a phenomenal advantage over traditional messaging brokers. Traditional message brokers have the disadvantages of being a bottleneck and of making it hard to replay messages when you have updated one microservice to fix a bug or to add new functionality you wish to retroactively apply to historic data.

The patterns book recommends writing to an outbox table then transferring from there to a message broker. This can be done using a background thread in each microservice that polls the outbox and pushes to the message broker. Alternatively you can read the outbox writes from the database commit log using change database capture. There is one pattern for asynchronous messaging that I have seen used to good effect but that I have not seen written down anywhere. This is to expose the outbox table via an API that lets you page backwards through the history of messages. You can then have a brokerless architecture. If you don’t mind some additional latency then this has all the advantages of consumers being able to independently replay messages.