A Fail-in-Place Platform—RatioPerfect™ Every Time

by Jeff Hughes – October 25, 2016

We have previously described our vision around Zero-Touch Infrastructure™, the first of two key architectural components that enable us to deliver on the promise of a True Cloud for Local Data.  In this article, I will expand on the second key architectural component which is on-premises appliances are truly able to be a fail-in-place platform.

Multiple conversations with customers that had an “at-scale” storage and server infrastructure had confirmed what had been our experience:

  1. At scale, components routinely fail.
  2. Servicing failures manually is laborious, cumbersome, and causes friction. The cost of labor to service failed parts exceeds the actual cost of the parts themselves.

Our goal was that no matter what component (disk, memory, power supply, fans, etc.) of our on-premises appliances failed, it should (a) not impact the customer workflows, and (b) not require human intervention to mitigate.

As a platform, traditional storage servers — with dual/quad Intel Xeon processors, 32–128GB of DRAM, and 60 or more disks — were a complete anathema to us. To begin with, lots of compute power is accessing hundreds of terabytes of data across a thin straw (6Gbps SAS bus!). 

traditional-storage-diagram.png

Even more worrisome, though, was the fact that a storage server failure meant that hundreds of terabytes of data were instantly unavailable and the failure necessitated human intervention to resolve by replacing the server. Sure, software erasure coding techniques could be used to “recover” the unavailable data, but the time to rebuild hundreds of terabytes is measured in multiple weeks, and during which the customer’s workflow would be significantly impacted. 

Enter our patented RatioPerfect architecture. We leveraged off-the-shelf commodity ARM processors (low cost, low power) and built a dedicated “controller” for every drive.

fail-in-place-diagram.png

Each drive has its own ARM-based controller that runs Linux and is dedicated to managing just that drive.  This converts a dumb disk drive into an intelligent nano-server that is (dual) Ethernet-connected and runs part of our software stack.  The problem of lots of compute power accessing hundreds of terabytes of data through a thin straw is completely resolved, as we now have unrestricted gigabit bandwidth to each and every drive.  The “blast radius” of a component failure is reduced from hundreds of terabytes to the capacity of a single drive.  Our software automatically detects any component failures, recovers data from the failed nano-server, and routes around the failed components — all without requiring any human intervention.

The result is an appliance platform that is both a commodity and truly fail-in-place, so we can bring our customers the True Cloud experience for their Local Data.  In a future article, I will dig deeper into the technical reasons behind this architecture.