slash dev slash null

simbo1905’s ramblings about computers

Category: Domain Drive Design

Agile and Domain Driven Design (Part 4): Microservice Sagas

The last post discussed breaking up a digital service into bounded contexts that can be enhanced by small autonomous teams. A microservice architecture allows teams to focus on domain driven design by creating aggregate entities that enforce the invariants of a business. The catch to such autonomy is the price of needing to run a distributed system. Technologies such as Kubernetes and a Service Mesh help. Yet there remains an additional complexity that you have to abandon global transactions and the “I” of ACID updates. You also need to use asynchronous messaging and deal with eventual consistency. Why?

The first issue is the laws of probability don’t allow you to use synchronous updates across multiple services. If you aim for 99.9% uptime for each service that is 1m 26s down time a day. If a screen needs to make a dozen interservice calls to gather data then the synchronous success rate will be (99.9)^12 which is only 98.8% uptime. That corresponds to 17m 16s downtime a day. Yet that is idealistic maths and you won’t achieve that without a lot of hard work. The reality is that if you are under high load and start to have lost requests users are going to hammer the refresh button and you are going to have cascading failures. Your success rate will drop to 0% until the load goes away. The book ”The Art Of Scalability” has some real world horror stories of live sites crashing for hours or days under load. You might be lucky such that this happens rarely and over the long period it averages out to high uptime. Defensive coding like load shedding and circuit breakers can help claw you back up to the theoretical limit. Yet every failed synchronous unit of work is lost revenue if the user closes their app due to errors and timeouts.

The get out of jail card is asynchronous messaging between microservice. Your success rate can then be the uptime of a cluster of sharded message brokers. The price you pay is four-fold:

  1. You will need to do multiple asynchronous reads and writes to action client commands. Debugging a set of asynchronous reads that led to a bad write is hard. You should deploy a distributed tracing solution which needs its own infrastructure.
  2. You cannot have an atomic global transaction across services. Writes that span services will use a local transaction per service. When the last write fails, you will need to back out each prior write. You will need to write roll-back logic that sends compensation commands to undo partially completed work.
  3. You cannot have the ACID properties that monoliths achieve when using relational databases. When writes span multiple services any interleaved reads will see ”partial work”. Partial work may lead to users experiencing transient “anomalies”. The best you can achieve is eventual consistency where all the local transactions succeeded or failed. If your UI crashes when it encounters partially completed work you are going to suffer.
  4. You need to use message broker infrastructure which adds operational complexity.

You need to use defensive coding that gracefully deals with seeing incomplete work. A user who pushes the “buy now” button expects to then see that they have bought the product. If you have the UI query the last of a dozen asynchronous operations needed to action the “buy now” command then under load users wont see their own pending asynchronous writes. You need to be assertive at the UI and confirm to the user that their order has been received on the first synchronous call. You must have robust coordination code so that you can be confident that the happy path will work. The business logic might take several seconds, or even several minutes, to weed out the few bad orders. The relatively few unhappy path orders can be dealt with as asynchronous failure notifications back to the client.

Initially this may not seem so bad. Consider the case of a takeaway food ordering application. This appears to only need to make a single atomic write to the food order service to make revenue. Unfortunately, things are not going to be that simple. In an API and “mobile first” world we need to write security conscious code that revalidates the whole scenario when it gets a command from a client. A monolith can use a single database transaction to read the current active state of many disparate rows spanning many business domains to confirm the validity of a write within the same transaction. To action an order you might beed to validate the order, and the customer, and the state of payment, and the availability of fulfilment, and only then write an update. When you are running microservices each read and write is a separate async call running a separate transaction. Then factor in that you might also need to made an additional write to the a reporting or search service. A criticism of ORM with monoliths is that developers are unaware of all the round trips made to the database in typical business processing leading to poor performance. It is “too easy” to write poorly performing code that functions correctly. When you are reading and writing to many microservices the complexity will be very apparent. It is very hard to write correct code and it is also going to be a lot slower than monolithic code. It will also be a lot harder to debug. With microservices it is too easy to write poorly preforming code that is less reliable than monolithic code that reveals anomalies to users that are very hard to debug.

With all those warnings aside how do we perform validating reads and then one of more writes across multiple microservices as a series of local transactions? Well there is no magic pixie dust you are going to have to either write a lot of code else use a sophisticated framework. With a monolith you can start a single transaction, make all your reads and writes, and either commit or throw an exception to rollback. With a microservices you need to create a saga which tracks the series of asynchronous reads and writes to it’s conclusion. More importantly if you need to make multiple writes then you also need to action compensations to reverse out any earlier writes that succeeded when a later write fails. You are likely to experience “dirty reads” of partially completed sagas under load that you don’t experience with monolithic database transactions.

The Microservices Patterns book covers two patterns for sagas. The first is called choreography where there is no central coordinator rather you have a series of messages where the outcome will be a satisfactory result. This seems like a recipe for spending many hours digging through the logs and code of many services to try to debug an unsatisfactory outcome. The second pattern is to have some orchestration logic to coordinate the forward motion and the compensations during exceptional processing.

A saga orchestrator is simply a finite state machine that issues and responds to messages. People seem to find the phrase “finite state machine” intimidating. This is a bit odd as most business UIs are finite state machines: they responds to input by showing outputs and guide a user through a journey. A traditional websapp basket check-out is the canonical finite state machine. It responds to http requests by updating its database and showing http responses. The shopping basket logic is a finite state machine running the saga of a check-out. If you give it the right sequence of requests you complete your purchase. A microservice saga is very similar. It issues messages and updates private state based on the responses. It guides the business process to competition or runs compensations to reverse out writes when later errors are encountered. The differences to a basket check-out webapp is that a saga is initiated by a single user request then it runs robotically. It can also be more active by issuing messages in parallel. It can also automatically retry idempotent messages when it times out on replies.

What asynchronous messaging technology should we use? It is fashionable to run Kafka as a distributed log. Indeed keeping a good chunk of events and commands in Kafka that you can reply on the “read side” without having to resend messages is a phenomenal advantage over traditional messaging brokers. Traditional message brokers have the disadvantages of being a bottleneck and of making it hard to replay messages when you have updated one microservice to fix a bug or to add new functionality you want to retroactively apply to historic data.

The patterns book recommends writing to an outbox database table then then transferring from there to a message broker. This can be done using a service internal process that polls the outbox and pushes to the broker. This avoids having to use a global translation to write to both the DB and topic. Instead you have a dedicated mini-saga that is a database write that includes writing an outbox row followed by an idempotent write to the message broker. Alternatively you can avoid pooling the outbox using SQL by lifting the outbox writes from the database commit log. There is one pattern for asynchronous messaging that I have seen used to good effect but that I have not seen written down anywhere. This is to expose the outbox table via an API that lets you page though messages since the last message you last reead. You can then have a brokerless architecture. If you don’t mind some additional latency then this has all the advantages of consumers being able to independently replay messages. The reader grabs the next batch of messages and can process them under a single database transaction that also records the new high watermark of read messages. While this can add a small amount of latency it can have good throughout. It is also incredibly easy to code and debug and can be implemented very easily as normal API. It is also perfect for disaster recovery. If you corrupt the database of a backend service and restore from a day old backup it will reread and replay the last days messages automatically to catch up:

Agile and domain driven design (part 3): Microservices

In the last post we imaged we are a programmer on an agile digital service team who had gotten as far as writing some DDD code. It is a rich OO domain model that knows how to obey the business rules (aka enforce the invariants of the contract aggregate). The next question is where should this code run? What provides the surrounding workflow? Which screens, running where, drive this “thing”? Talking it through with the other developers in the team you decided to build and deploy a contracts microserviceRead the rest of this entry »

Agile and domain driven design (part 2): Event Storming

The last post set the scene for how an agile digital services team gets to the point where it is ready to cut some DDD code. Imagine that you just joined the team as a programmer as the programme is ramping up its private beta build out. To align ourselves to the demo code of the last blog series you are need to build out some stories about how customers and your internal users create and agree a contract to deliver products. You are an agile analyst programmer who wants to build a ubiquitous language with the users of the system. So you attend an event storming workshop with the users. Read the rest of this entry »

Agile and domain driven design (part 1): Digital Services

In the last mini-series of posts I sketched out how to use DDD to build an explicit rich domain library that models the business domain and enforces the invariants within a narrow scope of an aggregate of entities. Catching up with my friend we got into a discussion about how we get to the point were are ready to implement things. How big can the model be? How do we scale to many two pizza teams? How do user needs, business processes, and screens relate to the domain focused OO design? We ended up talking about microservices. So in this series of post I am going to sketch how a large scale agile digital transformation programme gets to the point of cutting DDD code. I will then get into how a large project comes to be a platform with a micro-services architectureRead the rest of this entry »

Domain Driven Design: Entities, Value Objects, Aggregates and Roots with JPA (Part 5)

This is the last article in the series which discusses a sample app that does DDD using JPA. I would hesitate to recommend using JPA for DDD unless you are very familiar with JPA. Why? Some of the issues you can hit using JPA are written up on the Scabl blog on Advanced Enterprise DDD. In my own code and this blog, I explained how to dodge some of the bullets, but you need to be quite aware of the pitfalls of JPA to dodge them all. So why did I write the code? Read the rest of this entry »

Domain Driven Design: Entities, Value Objects, Aggregates and Roots with JPA (Part 4)

Don’t abuse the `public` keyword in Java. The demo source code has very few public classes or methods. This is unusual for Java projects. Typically Java projects have package layouts that model the solution; “this package has all the entities, that package has all database related code, that package is all the services code.” That approach forces you to make almost everything public. There is no way the compiler can enforce boundaries that align to the business domain. In the long term, on a big project, many brittle connections will be made across business responsibility boundaries. Bit rot creeps into the design and your application becomes a big ball of mudRead the rest of this entry »

Domain Driven Design: Entities, Value Objects, Aggregates and Roots with JPA (Part 3)

Where’s the application in the demo code? There isn’t one.

If you look at the source code there is no front-end, no web servlets, no screens, and no Java main class, and so no way to run it as an application. All that you can do is run the test class. So it is a library project. It is a rich “back-end” that can talk to a database. In this post, I will recommend that you don’t share such a library project between multiple teams.  Read the rest of this entry »

Domain Driven Design: Entities, Value Objects, Aggregates and Roots with JPA (Part 2)

Detour: Why use  JPA in this demo?

For the purposes of this demo, JPA is an officially supported part of the Java ecosystem and is a mature and well documented Java-to-relational mapping tool. Yes, it has quite a few quirks. If you fight it you will probably lose (your mind). If you learn how to do the basics and don’t deviate from that it can be a used as a rapid application tool to support an agile TDD build on Java against a relational database. Read the rest of this entry »

Domain Driven Design: Entities, Value Objects, Aggregates and Roots with JPA (Part 1)

A friend with a relational database background was working on an OO domain modeling problem. I started talking about “aggregates” and “roots” and things like “make the contract entity an aggregate controlling the other entities” and that “external logic should speak to the object model via a few root entities.” So I wrote demo project is some Spring and JPA code in Java to demonstrate those concepts. This blog series will be some discussion around the design and implementation techniques.  Read the rest of this entry »