Blog

Subscribe to Email Updates

A Fail-in-Place Platform—RatioPerfect™ Every Time

by Jeff Hughes – October 25, 2016

We have previously described our vision around Zero-Touch Infrastructure™, the first of two key architectural components that enable us to deliver on the promise of a True Cloud for Local Data.  In this article, I will expand on the second key architectural component which is on-premises appliances are truly able to be a fail-in-place platform.

Multiple conversations with customers that had an “at-scale” storage and server infrastructure had confirmed what had been our experience:

  1. At scale, components routinely fail.
  2. Servicing failures manually is laborious, cumbersome, and causes friction. The cost of labor to service failed parts exceeds the actual cost of the parts themselves.

Our goal was that no matter what component (disk, memory, power supply, fans, etc.) of our on-premises appliances failed, it should (a) not impact the customer workflows, and (b) not require human intervention to mitigate.

As a platform, traditional storage servers — with dual/quad Intel Xeon processors, 32–128GB of DRAM, and 60 or more disks — were a complete anathema to us. To begin with, lots of compute power is accessing hundreds of terabytes of data across a thin straw (6Gbps SAS bus!). 

traditional-storage-diagram.png

Even more worrisome, though, was the fact that a storage server failure meant that hundreds of terabytes of data were instantly unavailable and the failure necessitated human intervention to resolve by replacing the server. Sure, software erasure coding techniques could be used to “recover” the unavailable data, but the time to rebuild hundreds of terabytes is measured in multiple weeks, and during which the customer’s workflow would be significantly impacted. 

Enter our patented RatioPerfect architecture. We leveraged off-the-shelf commodity ARM processors (low cost, low power) and built a dedicated “controller” for every drive.

fail-in-place-diagram.png

Each drive has its own ARM-based controller that runs Linux and is dedicated to managing just that drive.  This converts a dumb disk drive into an intelligent nano-server that is (dual) Ethernet-connected and runs part of our software stack.  The problem of lots of compute power accessing hundreds of terabytes of data through a thin straw is completely resolved, as we now have unrestricted gigabit bandwidth to each and every drive.  The “blast radius” of a component failure is reduced from hundreds of terabytes to the capacity of a single drive.  Our software automatically detects any component failures, recovers data from the failed nano-server, and routes around the failed components — all without requiring any human intervention.

The result is an appliance platform that is both a commodity and truly fail-in-place, so we can bring our customers the True Cloud experience for their Local Data.  In a future article, I will dig deeper into the technical reasons behind this architecture.

Related Content

4 Requirements of Modern Backup and Archive

February 20, 2018

As enterprise datasets grow at unprecedented rates, with the majority of it being unstructured data, requirements for modern backup and archive have expanded beyond the capabilities of legacy secondary storage systems.

read more

New Feature: Enable Secure Auto-Discovery of Exports in NetApp with Certificate-Based Authentication

January 24, 2018

Enterprise IT must balance the need to support the business and application requirements and the need to protect the data. Administrators must not only serve the needs of users, but also serve the data protection needs of the organization. 

read more

New Feature: Increase Data Efficiency and Costs-Savings with End-to-End Compression

January 12, 2018

As businesses experience increasing data growth, enterprise IT must balance the need for more capacity with the need to constrain their datacenter footprint and control costs.

read more

Comments