Preserving the Egg on the Web

2018-03-31 @ 12:32#

When I plan out an implementation for the Web, one of the things I think about is the problem of "breaking eggs." One great example of this is the old adage, "You can't make an omelette without breaking some eggs." That's cute. It reminds us that there are times in our lives when we need to commit. When we need to forge ahead, even if some people might disagree, even if there seems to be "no turning back."

However, this "omelette" adage is not what I mean when I think about Web implementations and eggs.

Instead, I think about entropy and how you cannot 'unscramble' an egg. I won't go into the physics or philosophical nuances of this POV except to say, when I am working on a web implementation I work very hard to avoid 'breaking any eggs' since it will be quite unlikely that I'll ever be able to put those eggs back together again.

I don't want my Web solution to end up like Humtpy Dumpty!

Web Interactions as Eggs

The web is a virtual world. It is a highly-distributed and non-deterministic -- much like our physical world. We can't know all the influences and their effects on us. We can only know our immediate surroundings and surmise the influences based on what we observe locally. The world is a random place.

So each time we fill out a form and press "send", each time we click on a link, we're taking a risk and stepping into the unknown. For example:

  • Is there really a page at the other end of this link or is there a dreaded 404 waiting for me at the other end?
  • Have I filled out the form correctly or will I get a 400 error instead?
  • Or, have I filled out the form correctly, only to encouter a 500-level server error?
  • Finally, what if I've filled out the form, pressed "send" and never get a response back at all? what do I do now?

But What Can Be Done?

When I set out to implement a solution on the Web, I want to make sure to take these types of outcomes into account. I say "take them into account" because the truth is that I cannot prevent them. Most of the time these kinds of failures are outside my control. However, using the notion of Safety-I and Safety-II from Erik Hollnagel, I can adopt a different strategy: While can't prevent system failures, I can work to survive them.

So how can I survive unanticipated and un-preventable errors in system? I can do this by making sure each interaction is not an "egg-breaking" event. An "egg-breaker" is an action that cannot be un-done, cannot be reversed. In the web world, this is an interaction that has only two outcomes: "success or a mess."

A great example of the sad end of the "success-or-a-mess" moment is an action like "Delete All Data." We've probably all experienced a moment like this. Most likely we've answered "yes" or "OK" to a confirmation dialog and the moment we did, we realized (too late) that we "chose poorly." There was no easy way to fix our mistake. We had a mess on our hands.

The obvious answer to this kind of mess is to support an "undo" action to reverse the "do." This turns an "egg-breaking" event into and "egg-preserving" event. And that's what I try to do with as much of my Web implementations as possible -- preserve the egg.

Let's look at some other ways to prevent breaking eggs when implementing solutions in a non-deterministic world...

Network-Level Idempotency

One of the ways you can avoid "a mess" is to make sure your actions are idempotent at the network level. That means they are repeatable and you get the same results every time. Think of an SQL UPDATE statement. You can update the the firstName field with the value "Mike" over and over and the firstName field will always have the same value: "Mike".

In the HTTP world, both the PUT and DELETE methods are designed as idempotent actions. This means, in cases where you commit a PUT and never recieve a response, you can repeat that action without worry of "breaking the egg."

Relying on network-level idempotency is very important when you are creating autonomous services that interact with each other without direct human intervention. Robots have a hard time dealing with non-idempotent failures.

Service-Level Event Sourcing

At the individual service level, a good way to "preserve the egg" is to make all writes (actions that change that state of things) reversible. Martin Fowler shows how this can be done using Event Sourcing. Event Sourcing was explained to me by Capital One's, Irakli Nadareishvili as a kind of "debit-and-credit" approach to data updates. You arrange writes as actions that can be reversed by another write. Essentially, you're not "un-doing" something, you're "re-doing" it.

Fowler shows that, by implementing state changes using Event-Sourcing, you get several benefits including:

  • detailed change logs
  • the ability to run a complete rebuild
  • the ability to run a "temporal query" (based on a set date/time)
  • the power to replay past transctions (to fix or analyze system state)

I like say that, with Event-Sourcing, you can't reverse the arrow of time, but you can move the cursor.

Solution-Level Sagas

In 1987, Garcia-Molina and Salem published a paper simply titled "Sagas." This paper describes how to handle long-lived transactions in a large-scale system where the typical Two-Phase Commit pattern results in a high degree of latency. Sagas are a another great way to keep from "breaking the egg."

Chris Richardson has done some excellent work on how to implement Sagas. I like to think of Sagas as a way to bring the service-level event-sourcing pattern to a solution-level of multiple interoperable services. Richardson points out there is more than one way to implement Sagas for distributed systems including:

  • Choreography-based (each service publishes their own saga events)
  • Orchestration-based (each saga is managed by a central saga orchestrator)

Sagas are a great way to "preserve the egg" when working with multiple services to solve a single problem.

And so...

When putting together your Web implementations, it is important to think about "preserving the egg" -- making sure that you can reverse any action in case of an unexpected system failure. Working to avoid "breaking the egg" adds a valuable level of resilience to your implementations. This can protect your services, your data, and your users from possibly catastrophic events that lead to "a mess" that is difficult and costly to fix.

In this post, I shared three possible ways to do this at the network, service, and solution level. There are probably more. The most important thing to remember is that the Web is a highly-distributed, non-deterministic world. You can't prevent bad things from happening, but with enough planning and attention to details, you can survive them.

now that this post is done, anyone hungry for an omelette?

Systems