Cold Data

by Christian Smith – May 29, 2015

I just ran across an IDC forecast that predicted the human race would accumulate 8 zettabytes of data by 2015.  That got me thinking about storage.  First, how much is 8 ZB?  It is not an easy quantity to conceptualize, but let me try.

I have a small 8TB NAS under my desk.  To store 8 zettabytes, I would need a BILLION of those.  That’s more than 5 billion pounds of hard disk, and would require the entire output of the largest electric power producing facility in the U.S. to run (Grand Coulee Dam).

Okay, 8 ZB is a huge amount of data, but that’s not my real point.  What I got to thinking about is how much of this data do we really need?  I am not saying we should just delete it, but how (and where) do we store data knowing we will rarely access it again?

Cold Data, Defined

Since this blog post is about cold data, I should define that term.  A common definition is ‘inactive data that is accessed only infrequently.’  Many people interpret this as crap they no longer care about and would really prefer to delete.

I would contend this misses the most important piece:  Most cold data is stuff you really don’t want to get rid of.  You need it, if infrequently.  Maybe it is CAD drawings for an office building already built, DNA sequencing data from samples years past, or the assets for Toy Story 2. 

You may not access it today or tomorrow, but you likely will in the future.  You don’t want to delete it.  You are stuck with it. 

Your Storage Responsibilities

Your directive, as it pertains to cold data, is simple.  Keep it safe, secure and accessible.  The safe part is often called ‘durability.’  It means it will be there when you need it, no matter what.  And accessibility?  Well, while we don’t access the data frequently, when we do we want it quickly and we don’t want to feel like we’re mounting a mission to Mars to get it.  

The Cold Data Dilemma

This brings me to my point.  We have a cold data problem.  Most of that 8 ZB IDC says we have is cold data.  The problem is we aren’t proactive about how we store it.  It is hard to classify data as ‘hot’ or ‘cold,’ so we just leave it laying around where we created it.  If we run out of space, we just buy more storage.

“Ninety percent of the data on my systems could be considered cold” is a frequent theme we hear from storage teams.

Leaving cold data on primary storage is a bad idea.  First, primary storage is too expensive for stuff we rarely look at.  Furthermore, leaving cold data on primary storage vastly inflates the volume of information, which bogs down the storage system and increases the amount of time it takes you to manage primary storage infrastructure.  Not to mention how hard it is to find anything with so much extra information.

Tape is not the answer.  Tape is cumbersome to operate and notorious for not being very durable. Besides it’s not, in actuality, all that inexpensive!  (More on that in a future post).

What about cloud?  Remember that mission to Mars thing?  It is cumbersome, slow and expensive to bring cloud data back to your applications when you do need it.

It turns out that storing cold data on premises-based hard disks is our best option, but we need a new architecture to smooth things out.

How Facebook Addressed Its Cold Data Problem

No matter how bad your cold data problem is, Facebook’s is even worse.  That picture you posted yesterday from Hawaii?  The one with the funny straw hat and the Hulu skirt?  It may have generated 55 likes today, but by next month very few people will look at it and by next year it is as cold as cold data gets.

And, since Facebook gets 350 million new photos every day, it was worth some deep thinking at Facebook on how to solve their cold data problem.  Here is what they ultimately did:

First, they built a 62,000 square foot data center dedicated to cold data.  Actually, they built several of these cold data centers around the world, next to all of their main data centers.

In each facility, they put 500 racks with 2 petabytes of storage per rack.  That’s a total of 1 exabyte of data per facility.  Since it was dedicated to cold data, they optimized the performance of these storage racks accordingly.  The details are beyond the scope of this post, but at a high level they created a cold data center with remarkably low costs and high durability.  Being located next to their main data centers made it easy to get to the data if anyone ever does want to get to the data.

Lessons for Enterprise

What can enterprise IT learn from Facebook’s cold data initiative?  I see five basic lessons we should all keep in mind:

  1. Keep cold data on hard disks.  Commodity drives are super cheap now, and anything else will cause issues in terms of durability or ease of access.
  1. Keep the data close to where it was produced and will be consumed.  Moving data is expensive and cumbersome.  You are better off keeping it close.
  1. Avoid solutions that require you to re-architect your applications.  You may not need to get to the data very often, but when you do you want it to be simple and easy.
  1. Trade performance for durability.  High performance with reasonable durability is a formula for disaster.  You want high durability with reasonable performance.  At the lowest price possible.
  1. Aim for zero-touch management.  You are busy enough without having to babysit data nobody ever looks at.

Let me know, how big an issue is cold data? Do you have a “cold data” issue?  What is your view on the cold data occupying your systems today?  Where do you think it should reside and Why?