Tag Archives: failure

Failure in Distributed Systems

When you work on distributed systems you learn to think about problems a little differently. There are many views on what exactly a distributed system is. Generally, it involves multiple parties that are communicating and coordinating with each other but acting independently. In these kinds of systems perhaps the most critical questions are around the failure modes: what are the failures you anticipate and how does a system handle these and other failures when they happen?

One of the best-known examples of this is the CAP Theorem. Originally conjectured by Eric Brewer and later expressed as a provable theorem by Seth Gilbert and Nancy Lynch CAP is an observation about trade-offs. In a distributed system it’s hard to maintain full Consistency of data and Availability to clients in the face of network Partition. If you’re reading this blog then chances are that you know all about this.

There are dozens of great discussions already about CAP, including a follow-up by Dr. Brewer. The Internet probably doesn’t need another rant on what he was trying to state and where the theory fits or falls apart. What really interests me, however, is how well the original conjecture captured a truth about distributed computing. When you choose to take a local system and distribute it you’re doing so to solve a problem. The key is understanding what that problem is and what trade-offs are required.


Do any search for CAP and you’ll find at least two common themes. One is the explanation that CAP means “choose two of three.” For instance, systems often choose to be Consistent in the face of network Partition or choose to maintain Availability and leave it to the application to sort out Consistency at some point in the future. The other common theme is that this first theme is mis-interpreting what CAP is really about. Generally speaking, I fall into the latter group.

The thing is, distributed computing is all about understanding these kinds of valuations. For instance, when you’re running on a single system you have a single point of failure but also a single, clear view of time. When you distribute the same work across multiple systems you get resiliency in the face of failure but coordinating the wall clock is harder.

There are several ways to deal with this. One is to throw time out the window (you know the one about the boy who wanted to see time fly?). You can build a system that has no reliance on timestamp ordering, but to make it scale you either need to partition where the work is done, assume a pretty loose consistency model or come up with ordering based on some other constraint. If you want to coordinate based on ordering then you need to assume a very loosely synchronized clock or other mechanism for relating messages and events. You could also blow your budget on atomic clocks but most of us would rather put that kind of investment elsewhere.

What does this have to do with CAP? While Consistency may sound like a pretty black-or-white thing it is itself a series of tradeoffs. There are dozens of different views on what it means to maintain data consistency, but each of them needs to start by defining some notion of how tasks relate to each other. Simply saying that a system is Consistent in the face of network Partition doesn’t in itself tell you all that much. In any given distributed system you have to think about what the consistency model is, how failures can affect what you’re supposed to get and therefore what it means to provide or fail to provide consistency.

Availability is a similar story. When you distribute a system to provide redundancy you’re getting higher-availability in the face of any node failing. In other words, if you have two equal peers, and one goes away, you’re still available just at decreased rates. The proven theorem (linked above) framed the problem where all clients must always get a response to all queries or availability hasn’t been met. I would argue that in a practice as long as the service as a whole is available for clients to access, if an occasional re-try or re-connect is required that’s an acceptable trade-off.

Actually, I’ll say something stronger. It’s easy to get hung up on Partition, but in practice systems fail in all kinds of crazy ways. In some sense a “clean” break between two data centers is easier to deal with than a confused switch giving everyone a different story about network topology. A system’s availability should be measured in the face of any failure. A good distributed system should give you at least some availability in the face of the unexpected. Availability isn’t an either/or, it’s about maintaining access to a service while guaranteeing that the requirements are still being met. If a system can react to loss at one point by bringing up additional support somewhere else (a good feature for any redundant system) then in practice you’re getting high levels of availability even if this breaks with the theorem.

Start With Requirements

Some people I talk with have the tendency to hear “distributed system” and jump right to CAP. My advice is to back-up, and start by thinking about what problems you’re trying to solve. Do you want some form of consistency? If the answer is yes, then you chose that because of your deployment requirements and the programming model that best serves your application. In that case, you probably don’t want that consistency model violated at any point.

My bias? I like consistency. That doesn’t mean every application needs to be written against a service with consistency guarantees, but it simplifies building an application. You have to choose the best tool for the job. In my experience, when you work with data you need some rules in place. If the model isn’t in the underlying service then it’s codified in the application logic, often implicitly, which puts burden on developers to stay focused on that model as the application evolves. Yes, that’s another trade-off I’m talking about, so it’s just my bias not a truth about building software.

Next, what are you availability requirements? Don’t stress about the theory; think about what a system gives you in practice. When I run some service on a single machine I expect there will be down-time. When I distribute a service across multiple machines, now I’m expecting that the system as a whole will be resilient. It’s not just about failure, either: I want that system to be able to scale or migrate on demand, so I expect it to be agile. Personally, I care less about absolute availability then the ability for a system to recognize failure in one place and react somewhere else. Failures, including partition, are a reality of distributed systems. It’s as much about how quickly the system can react that will tell you whether your requirements are being met.

Finally, think about the problem that you’re actually trying to solve in your application. Chances are that it does have some implicit assumptions about the data (e.g., customers only look at their own shopping carts or users only access their own session state). In this case, as long as the rules are clear you can maintain consistency even when the network is partitioned. My favorite example of this is Google Wave, but the point is that consistency isn’t an absolute concept. The constructs required to prove a formal theorem may be completely valid but still not represent the nuance of actually deploying running code.

Any distributed system is an exercise in understanding a problem, weighing the trade-offs and being clear about how the system reacts to the unexpected. When you approach a new system don’t ask how it falls in a CAP trade-off, ask what the goals are that it’s trying to meet and how it will help you address the problems that you have. Understand how it reacts to failure in general, and whether it will help you keep your system running smoothly over time. Think about the programming model, and keep in mind that consistency will have to be defined and enforced somewhere in the stack. CAP is an excellent tool, but take it in the spirit it was intended. Seriously, this is a great time to be building systems given so many new technologies that are approaching problems from new points of view. Have fun, stay curious and understand how any given system will help you solve not only the problems you know about but the challenges that are lurking on the horizon.