Three Thoughts on Cloud for 2014

Right around the end of December people like to post their predictions for the coming year. Me, I’m kinda swamped until about a week into January. By then I’ve had a chance to catch up a bit, both on sleep and on reading that fell to the side during the holidays. I’ve also had a chance to think about what I saw in 2013 and how that might play out in the next twelve months.

Seriously, 2013 was all kinds of exciting. For those of us who have been doing “cloud” long before it was called cloud (scale-out, horizontal scale, distributed computing, grid, utility, etc.) what we’ve seen happen recently is pretty awesome. There are a lot of great environments to work in today, both as “public” and “on-prem” deployments. There are also too many cool services to count, from monitoring to provisioning to automation. I’m a big believer that what defines big-C Cloud is that it’s all about being agile and easy. From that point of view, 2013 was the year that I really saw everything come together.

So what’s next? I’m not going to try for any “bold” predictions. These are mostly what I’ve been observing lately and a way to give me some entertainment 11 months from now. As I look at the landscape I’ve got (at least) three thoughts about what I expect is going to happen this year, and I’ll be interested to know what others think.

Nothing will be single data-center

Maybe this one seems obvious to folks reading my blog, but it’s an idea I see met with surprising skepticism on a daily basis. The way I see it, almost no one runs a mission-critical system in a single data-center today. At the very least there’s some kind of Disaster Recovery option with hot-spare application infrastructure and replicated data available at a moment’s notice.

Five years ago this was something exotic. High-end enterprises had the resources and experience but it hadn’t tricked down. More to the point, getting that kind of hardware online was expensive and unrealistic. Today, however, I can go into Amazon and spin-up instances in two availability zones, or better still two separate regions, without much effort. We’re living in the future.

The thing is, increasingly it’s about a lot more than Disaster Recovery. Applications need to be active in multiple places at once. Since there are so many data-centers available around the world, it only makes sense to distribute application logic to give users lower-latency access. We did this a long time ago with CDNs and other caches; now the application and data tiers are catching up.

At Re:invent a few months ago this was a theme that I heard time and time again, both from developers and from the teams working on AWS. The thing is that while the infrastructure is there the software is still growing up. True, there are ways to get active-active deployments today, but these kinds of Geo-Distributed architectures are still challenging. Even Amazon’s Web Services are still largely focused on operation within a single region (I’m looking at you, Cloud Formation).

So what’s changing in 2014? The infrastructure for Geo-Distribution is available, from Amazon to Google to SoftLayer and others and that’s pushing software to make it usable. Is this a shameless plug for my company? Maybe, but the way I think about it I work on a distributed system because it’s what I’m passionate about and it’s where I think we’re headed. I expect this year to be the year when everyone gets behind that vision and when anyone building serious application is going to demand Geo-Distribution as a baseline.

Google Compute Engine will hit its stride

I first started playing with Google’s Compute Engine back in November 2012, in its early stages. The tools were nascent (at best) but the system as a whole showed a lot of promise. I was lucky enough to get that sneak-peak ’cause of great people over at Google who wanted early developer feedback. They gave us access to a large number of servers. The result (including high-larious demo hijinks caused by the building wifi failing) is available as a Google Developer Video.

Most of us use software and services from Google in one capacity or another on a daily basis. Most of us also have experienced heartbreak at least once when a Google project that was “in beta” for some extended period of time suddenly got cancelled. Remember Google Wave? If not, stop reading this and go read about Wave’s formal underpinnings. Seriously, the system behind the app was really something. So when GCE started allowing access as a “beta” I was curious to see where it was heading.

Broadly speaking, there are at least three camps in terms of cloud infrastructure. First, there are infrastructure providers focused on service-oriented offerings like Amazon. For the record, I am fiercely in love with how easy they’ve made it to get up & running whether I just want to test out an idea or crank on something at scale. Second, there are providers like Rackspace or SoftLayer who I see more as hardware providers with good distributed access. Third are the interface providers like Pivotal (or the open source projects like OpenStack or Eucalyptus) who are more focused on the service & API layers than specific infrastructure.

Where does GCE fall? From what I’ve seen so far, Google is still trying to sort that one out. They have a really solid infrastructure: fast, stable systems with good networking. They are layering management tools starting with flexible (if slightly verbose) command-line interfaces that seem focused squarely on infrastructure automation. They’re also exploring services that bump up the stack a few levels but are tied to fairly specific use cases (e.g., the schema and access requirements to make an F1 application scale). If pressed, I’d say that Google is starting with something in between raw infrastructure and bootstrap services, and seeing where the development community takes it.

In any case, that seems to be the point. Google is doing one of the things that they do best: they’re creating a developer tool and seeing what developers will do with it. About a month ago I was psyched to see the “beta” label removed from GCE. Are developers going to drop familiar platforms? No, of course not. What I expect, however, is that in 2014 developers are going to run GCE through its paces, and that’s going to help Google give GCE the clear identity it needs to emerge as a differentiated, developer-focused platform.

Consistency will matter

This one may be less-obvious, but bear with me a minute. Historically, distributed data was the realm of high-end enterprise (see thought one, above) where relational databases are de rigueur. Solutions like Tandem NonStop or Oracle RAC were designed to provide parallelism and scale-out behavior for transactionally consistent data management. That’s all well and good, but starting about eight or nine years ago requirements for distribution started pushing down into more general-use.

At this point two things happened. First, there were an increasing number of developers unwilling (or unable) to buy & deploy the high-end solutions. Second, there was a realization that even if you bought the licenses these might not be the right solutions for Cloud-style deployments. This led to rapid iteration using the tools already available (hello BerkleyDB) and basic insight into the problems that had to be solved on a very short timeline. The result was a new way of approaching data management that traded off global consistency for scale-out behavior.

Eventually the term “NoSQL” stuck, but the point here isn’t really about the language. It’s about an assumption that the right way (possibly the only way) to make a data management system scale out is to punt on consistency and let the application sort things out as-needed. For some applications this is a perfectly reasonable trade-off. No doubt several years ago this was a pragmatic response by developers in need of solutions to real problems. I mean, that’s what we developers do best.

The thing is, there’s nothing about SQL or Transactions or ACID that can’t scale. The tools and architectural assumptions in place at the time didn’t scale, true. As I’ve written about elsewhere, however, the programming models scale just fine if you’re willing to re-think the architecture. Every day I talk with developers or operators who are frustrated with having to choose between strong consistency models and scale. It’s not that every application needs strong consistency, but a lot of them do, and even if you might be able to get away with limited consistency most developers (in my experience) would prefer to work with a consistent system as long as it doesn’t limit what they can do.

This is the trend I expect to see in 2014. Developers now have good options for building-out systems at scale that don’t force them to re-invent consistency at the application layer, and increasingly they will demand it. Over the next year we’ll see the growth of new systems in this mold. We’ll also see increased effort in the NoSQL community to provide forms of consistency out of the box for developers. I’m not going all the way to say that these different world-views are going to merge into some common-ground, but I do think that’s the direction we’re heading, and I think it’s a good thing all around.

Happy 2014

Ok, yeah, there’s my short-list. Like I said at the start, I’m curious to know what anyone reading this thinks. Obviously this list isn’t everything on my mind, but it’s definitely what’s on the top of my stack. The high-order bit is that it’s really cool what we have at our disposal as developers, and we’re using those things to push requirements in crazy new ways. I’m wicked excited to see what the next year brings, and if anything I’ve said here pans out it’s going to be a fun year in the Cloud. Cheers.

Failure in Distributed Systems

When you work on distributed systems you learn to think about problems a little differently. There are many views on what exactly a distributed system is. Generally, it involves multiple parties that are communicating and coordinating with each other but acting independently. In these kinds of systems perhaps the most critical questions are around the failure modes: what are the failures you anticipate and how does a system handle these and other failures when they happen?

One of the best-known examples of this is the CAP Theorem. Originally conjectured by Eric Brewer and later expressed as a provable theorem by Seth Gilbert and Nancy Lynch CAP is an observation about trade-offs. In a distributed system it’s hard to maintain full Consistency of data and Availability to clients in the face of network Partition. If you’re reading this blog then chances are that you know all about this.

There are dozens of great discussions already about CAP, including a follow-up by Dr. Brewer. The Internet probably doesn’t need another rant on what he was trying to state and where the theory fits or falls apart. What really interests me, however, is how well the original conjecture captured a truth about distributed computing. When you choose to take a local system and distribute it you’re doing so to solve a problem. The key is understanding what that problem is and what trade-offs are required.


Do any search for CAP and you’ll find at least two common themes. One is the explanation that CAP means “choose two of three.” For instance, systems often choose to be Consistent in the face of network Partition or choose to maintain Availability and leave it to the application to sort out Consistency at some point in the future. The other common theme is that this first theme is mis-interpreting what CAP is really about. Generally speaking, I fall into the latter group.

The thing is, distributed computing is all about understanding these kinds of valuations. For instance, when you’re running on a single system you have a single point of failure but also a single, clear view of time. When you distribute the same work across multiple systems you get resiliency in the face of failure but coordinating the wall clock is harder.

There are several ways to deal with this. One is to throw time out the window (you know the one about the boy who wanted to see time fly?). You can build a system that has no reliance on timestamp ordering, but to make it scale you either need to partition where the work is done, assume a pretty loose consistency model or come up with ordering based on some other constraint. If you want to coordinate based on ordering then you need to assume a very loosely synchronized clock or other mechanism for relating messages and events. You could also blow your budget on atomic clocks but most of us would rather put that kind of investment elsewhere.

What does this have to do with CAP? While Consistency may sound like a pretty black-or-white thing it is itself a series of tradeoffs. There are dozens of different views on what it means to maintain data consistency, but each of them needs to start by defining some notion of how tasks relate to each other. Simply saying that a system is Consistent in the face of network Partition doesn’t in itself tell you all that much. In any given distributed system you have to think about what the consistency model is, how failures can affect what you’re supposed to get and therefore what it means to provide or fail to provide consistency.

Availability is a similar story. When you distribute a system to provide redundancy you’re getting higher-availability in the face of any node failing. In other words, if you have two equal peers, and one goes away, you’re still available just at decreased rates. The proven theorem (linked above) framed the problem where all clients must always get a response to all queries or availability hasn’t been met. I would argue that in a practice as long as the service as a whole is available for clients to access, if an occasional re-try or re-connect is required that’s an acceptable trade-off.

Actually, I’ll say something stronger. It’s easy to get hung up on Partition, but in practice systems fail in all kinds of crazy ways. In some sense a “clean” break between two data centers is easier to deal with than a confused switch giving everyone a different story about network topology. A system’s availability should be measured in the face of any failure. A good distributed system should give you at least some availability in the face of the unexpected. Availability isn’t an either/or, it’s about maintaining access to a service while guaranteeing that the requirements are still being met. If a system can react to loss at one point by bringing up additional support somewhere else (a good feature for any redundant system) then in practice you’re getting high levels of availability even if this breaks with the theorem.

Start With Requirements

Some people I talk with have the tendency to hear “distributed system” and jump right to CAP. My advice is to back-up, and start by thinking about what problems you’re trying to solve. Do you want some form of consistency? If the answer is yes, then you chose that because of your deployment requirements and the programming model that best serves your application. In that case, you probably don’t want that consistency model violated at any point.

My bias? I like consistency. That doesn’t mean every application needs to be written against a service with consistency guarantees, but it simplifies building an application. You have to choose the best tool for the job. In my experience, when you work with data you need some rules in place. If the model isn’t in the underlying service then it’s codified in the application logic, often implicitly, which puts burden on developers to stay focused on that model as the application evolves. Yes, that’s another trade-off I’m talking about, so it’s just my bias not a truth about building software.

Next, what are you availability requirements? Don’t stress about the theory; think about what a system gives you in practice. When I run some service on a single machine I expect there will be down-time. When I distribute a service across multiple machines, now I’m expecting that the system as a whole will be resilient. It’s not just about failure, either: I want that system to be able to scale or migrate on demand, so I expect it to be agile. Personally, I care less about absolute availability then the ability for a system to recognize failure in one place and react somewhere else. Failures, including partition, are a reality of distributed systems. It’s as much about how quickly the system can react that will tell you whether your requirements are being met.

Finally, think about the problem that you’re actually trying to solve in your application. Chances are that it does have some implicit assumptions about the data (e.g., customers only look at their own shopping carts or users only access their own session state). In this case, as long as the rules are clear you can maintain consistency even when the network is partitioned. My favorite example of this is Google Wave, but the point is that consistency isn’t an absolute concept. The constructs required to prove a formal theorem may be completely valid but still not represent the nuance of actually deploying running code.

Any distributed system is an exercise in understanding a problem, weighing the trade-offs and being clear about how the system reacts to the unexpected. When you approach a new system don’t ask how it falls in a CAP trade-off, ask what the goals are that it’s trying to meet and how it will help you address the problems that you have. Understand how it reacts to failure in general, and whether it will help you keep your system running smoothly over time. Think about the programming model, and keep in mind that consistency will have to be defined and enforced somewhere in the stack. CAP is an excellent tool, but take it in the spirit it was intended. Seriously, this is a great time to be building systems given so many new technologies that are approaching problems from new points of view. Have fun, stay curious and understand how any given system will help you solve not only the problems you know about but the challenges that are lurking on the horizon.