Finding My Data

by Christian Smith – October 21, 2015

As anyone who knows me knows, I hate traffic. Which is why the Waze app caught my eye. The moving map is amazingly accurate regarding traffic congestion, accidents, and police locations. It’s as close to real time that I have ever seen.

Curious, I later researched how this was happening and realized the solution the Waze guys had stumbled upon has a direct correlation to a storage problem I had been noodling on.

Let me start with how the traffic problem was solved. Previous traffic reporting solutions depended on out of band methods to collect the data (rider reports, traffic cameras, government systems). These were slow to respond to changes, and available only in select urban areas.

Recently solutions, such as Waze, have switched to using aggregated mobile phone location data. Your carrier knows precisely where you are at every moment. Mapping that data to highways provides the speed and congestion of traffic in any area in real time, which provides an extremely accurate, real-time view of traffic.

The key difference is instead of sitting on the outside and ‘watching’ for traffic, the system now sits on the inside and ‘reports’ traffic back to the main systems. The system is inherently scalable as there is a one-to-one relationship between cars and monitoring devices. More cars mean more devices, so the monitoring never falls behind.

This whole scenario applies directly to a problem petabyte-scale storage systems have: Search. Let’s do a quick review of the history of search.

In simple storage systems searching for data is done in a brute force manner. You start by examining the first item and continue until you reach the last item. The brute force method breaks down with even moderately large storage systems because it takes too long. You have no doubt witnessed this with primitive email searches, where it can take 20 minutes or longer to complete a simple search.

So, system folks improved this process by pre-indexing. The system creates an initial index of search terms by crawling the data item-by-item. This takes an enormous amount of time, of course, but when a user searches for something the file system refers to the index and returns the search results virtually instantaneously.

Great, but when applied to petabyte-scale storage systems this pre-indexing is of little value. It takes so long that by the time it is finished the underlying storage system has changed and the pre-index is no longer valid.

So, users are faced with two bad options for finding information on really big storage systems. The brute force method, which is way too slow to be usable, and the pre-index method, which is out of date before it is even ready.

The way the traffic guys fixed their problem is precisely how storage guys should fix the search problem. The key is to architect a ‘watch’ service that is integrated into the storage tier and can maintain an index in essentially real-time.

Kiran talked about AWS Lambda in a previous post and postulated that while Lambda was awesome for AWS S3 storage, a similar capability is needed for on-premise storage. This search problem is a perfect use case for why this is true. Here is how search indexing would work if you had an on-premise equivalent to Lambda:

Instead of sitting on the outside and watching, the indexing service would be triggered automatically every time the storage changed. If something was added, changed, deleted or moved, the indexing service would immediately (within seconds) kick-in and revise the master index.

The first advantage is that there would no longer be a way that the indexing system would get out of sync with the underlying information. The second advantage is that the indexing system would scale automatically to meet demand. As with Lambda, the service would automatically fire-up as many independent instances of the indexing service as needed. In other words, it would scale out as needed.

This is a perfect example of why on-premise storage needs an AWS Lambda like service architecture. It vastly simplifies tasks like indexing while providing the scale-out capacity that large-scale storage systems require.

At least, that’s my opinion. What’s your take? Oh, and by the way, I still hate traffic!