Failure is the New Normal

by Jeff Hughes – August 4, 2015

Have you noticed how the approach to failure in computing has changed radically over the past few years? Good thing it has! Software has officially eaten the world and nearly everything we interact with involves computing. Imagine if all those interactions failed daily. Fortunately, that doesn’t happen but let’s see how we got here.

The Age of System Crashes 

Early on, computing was a wildly unreliable endeavor. Systems were vulnerable to everything – heat, humidity, shock, even insects. As a result, computing was relegated to research and engineering. If a system crashed, you either had manual ways of accomplishing the same thing or just waited. There certainly wasn’t much real-time or mission critical computing.

The power of computing was clear. All of these failures needed to be addressed if computing was going to expand. Designers put an intense focus on reliability and that ushered in …

The Age of High Reliability

First came improved reliability at the component level, making robust hardware with very low failure rates. Later, more reliability was built at the system level with innovative techniques like RAID, Active-Active, and Active-Standby processors.

It was a noble goal – build systems that almost never fail. Businesses started depending on computing applications. A great example is travel reservations. Mainframes took the manual booking process and revolutionized it. If the mainframe was down for a day, you could just call back tomorrow.

Unfortunately, while trying to design nearly perfect reliability to eliminate that day of downtime sounds great, it became a fool’s errand. As systems scaled larger and larger, the math eventually catches up.

Take hard disk drives, for example. According to this study by Backblaze, they are seeing annual failure rates (AFR) of less than 1 percent to more than 12 percent. The ‘average’ AFR is about 3 to 4 percent. If you have 50 drives in your datacenter, you can expect one or two failures each year. That’s manageable with the “call back tomorrow” strategy. At 10,000 drives, you can expect a drive failure every day. Now we have a problem.

Admittedly, few folks have 10,000 drives. However, the same additive failure rates apply to the servers, networking devices, and racks and racks of other things that can fail. When you factor in the AFR rates for every device, failures are far more common. The ‘daily occurrence’ of failure is easily seen even at today’s modest scales.

If there was some way to increase the reliability of each component by 10x (hint: a similar study a decade ago shows roughly the same AFRs), the continued trend of scaling out will quickly eat up those gains. In the race between reliability and scale, scale always wins and the quixotic quest for perfect system reliability is destined to fail.

The hyper-scale guys saw this a while back. Something is failing at a Google or AWS infrastructure virtually all the time. They realized the answer was not in eliminating faults, but to design in tolerance of faults.

The Age of Fault-Tolerance

It is a radical, yet simple idea. Design your systems or processes to tolerate infrastructure failures. The phrase ‘Pets Versus Cattle’ which has taken off in the past year embodies that idea: “… in the new world of hyper-scale computing, engineers treat servers like cattle. It doesn’t matter if a cow dies as long as the herd survives.” (Wall Street Journal)

The old mindset was to treat infrastructure (for example, servers) like pets. We name them; if they get sick nurse them back to health, and celebrate when they are healthy again.

The new mindset, required when things are failing all the time, is to treat infrastructure like cattle. No names, if they get sick, shoot them in the head. This approach has a lot of advantages. First, it is FAR less labor intensive. No need to treat failures as emergencies.

There is a second more nuanced benefit though. If you design your system to tolerate failure, you no longer need top-of-the-line ultra-reliable devices. You gain the flexibility in choosing lower reliability, commodity components at a significant cost savings.

The age of fault tolerance, with massive scale at a new price point, has gone beyond just the mission critical systems and introduced the ubiquitous real-time applications experienced daily. Of course, it also brought a few pieces of yesterday’s science fiction along with it.

What about Enterprise Infrastructure?

Hyper-scale companies like Google, AWS, Facebook and Microsoft Azure have fully moved to this ‘cattle’ style fault-tolerant mode. The average enterprise isn’t there yet. Why not? Several reasons:

  • Enterprises haven’t scaled as big/fast as the hyper-scale guys, so the pain is not as great (yet).
  • It takes time for mindsets to change.
  • The big infrastructure vendors – upon whom enterprises rely – aren’t really there yet either.

On that last point – the big infrastructure vendors ALL talk about the ‘Pets Versus Cattle’ mindset, but let’s be honest. Their solutions, be it storage, compute, or networks, are hardly at a cattle price point. I suppose we could say ‘Pets Versus Kobe Beef’, but then who is going to shoot their Wagyu cow in the head?

This whole concept is very important to us here at Igneous. We firmly believe that building truly ‘zero-touch’ infrastructure requires us to treat infrastructure like cattle. We don’t even think you should be the one to shoot your cattle – the sick cow should shoot itself and disappear. You should not even notice its absence.

That’s my view, what’s yours? Do you have pets or cattle in your datacenters?