Extensible Data Path Architecture — A Bright Future for Your Data

by Jeff Hughes – November 1, 2016

Previously, I presented the details of our underlying RatioPerfect™ architecture and how that allows us to deliver a platform for your data that is free of I/O bottlenecks and truly a fail-in-place model. Today, I’d like to talk about our software architecture, in particular how it allows for an extensible data path.

At most businesses with even moderate amounts of active data, maybe a few hundred terabytes, we found a scalable compute tier deployed and managed separately in front of the data storage tier. This compute tier required custom software to accomplish even simple things like figuring out what data was created and when. When it came to complex actions, like transforming the data or extracting embedded metadata, even more customization and scheduling was required. As data sizes and ingest rates grew, not only did the storage layer grow, but so did the compute tier which generally brought complexity and fragility of the custom software stack. At data rates of terabytes per day, an application that is continuously scanning incoming data to do these simple operations would quickly start falling behind.

It struck us that when data is being written to durable storage, we know a whole lot of about it: where it is coming from, how big it is, what type of data it is, etc. We also knew that within these facts, we had the entire context and information required to satisfy a customer need for manipulating data while it was in transit. A simple example might be "Tag all incoming pictures by its origin, index embedded EXIF data to be searched, create a thumbnail for visual search, and compress the primary image." But no storage system was able to do this!

Extensible-Data-Path-trad-storage2.png

Traditional storage tiers are great at being a dumb bit bucket. They store bits and return them to you as along as you ask for it in the one right way. If an enterprise data application wanted access to Google Drive-like functionality, it had to be custom developed. Easy to do as a demo or proof-of-concept, but hard to productize and operationalize at scale. We knew it was time for an enterprise-class, intelligent data layer that would be more than be a dumb bit bucket.

Given that we had previously built more than one traditional storage system, we knew why things are the way they are. In the storage (and networking) world, there is this concept of a data path that is fast, highly optimized, and fragile. Adding any new functionality like that mentioned above requires modification to the data path. This is fraught with risk and thus release cycles from traditional vendors are measured in years. We knew we could do better, and in doing so, deliver to our customers a data platform that could not only reliably store data, but also eliminate an entire layer of compute and network infrastructure, as well as its management.

The key is our data path that is architected to be extensible in a way that provides advanced data services (such as search indexing, replication, data classification, etc.) in-line but out of the latency path. We do this by creating and persisting an event stream of all activity (reads, write, deletes) along with related and relevant metadata (time, requestor, etc.). By building data services around this “in-line but out of the latency path” event stream, we ensure that the primary application latency does not increase. Additionally, this decoupled architecture improves reliability and reduces the risk involved in developing and introducing new data services.  Think of our extensible data path architecture as akin to a network tap, except that it’s in software and persisted for replay.

Extensible-Data-Path-log-async-action.png

Under the covers we use a log-structured merge-tree to efficiently store this historical event stream. Its contents are accessible via REST APIs both in a pub-sub model and as a raw time-based event stream (think Apache Kafka). Coupled with code delivered in containers, the event stream makes it possible to focus on implementing just the unique and differentiated functionality quickly, without needing to re-architect the high performance data path with every feature. We are not inventing a whole new way of doing things. Our goal is to operationalize patterns like AWS Lamba, Google / Azure Functions (server-less computing) plus Apache Storm, Kafka, and Spark in a manner that accelerates our own development.

Extensible-Data-Path-ign-archi.png

So what's in it for our customers?  Using the same APIs to the event stream, the heavy lifting of undifferentiated infrastructure built to curate and decorate data can be eliminated as well, allowing them to focus on their unique differentiation.  They now have a platform for data and workflows that they can’t or won’t move to the public cloud, but has the flexibility and agility of public cloud development paradigms this was modeled on. The architecture enables delivery of new data services, like auto-classification of incoming data streams, in mere weeks as compared to months to years with traditional storage servers and virtual machines.