Enterprise storage systems are expected to keep data safe, and allow it to be accessed with excellent performance characteristics. This is a well explored problem space, and today many large corporations make their business out of selling hardware that stores data. Despite the relatively established nature of storage technology, remarkably fewer approaches have been explored in the design of storage systems.
Data storage in network environments has traditionally been designed in one of two ways: the dominant approach is to have a single, shared server (often called a target or array) that houses a bunch of persistent memory (disks or flash) and presents it over a network connection using a protocol such as NFS or iSCSI. A secondary, and less popular, approach is called “distributed storage” (or sometimes “clustered storage”) in which many network connected devices collaborate to provide storage functionality.
The centralized approach used in the first class is appealing because it is easier to build and reason about, however, it also suffers from challenges in achieving very high performance because a single device must scale to handle a very high rate of requests.
The second approach has potential benefits in terms of both performance and cost: many lower-cost storage targets may federate to provide a higher aggregate level of performance than can be achieved on a single server. Unfortunately, distributed storage presents problems in multiple areas. A large class of problems in distributed storage is that system-wide state (such as where the current and correct version of a piece of data is located) and system-wide decisions (such as whether a device has failed and how to recover) end up being distributed and involve a great deal of complexity of design and implementation in order to match the functionality of a centralized solution.
By and large, the design of enterprise storage is treated much like the design of any other software server: A piece of software is written to handle read and write requests, and this software is deployed on one or more end hosts. In some cases, these end hosts are actually sold, as a package, including the storage server software. Three common approaches to this design are summarized as belonging to Monolithic Storage Devices, Clustered or Parallel File and Storage Systems, and Peer-to-Peer or Overlay Network-based Storage.
Monolithic storage devices, often known as “Filers” (in the case of file-level protocols such as CIFS and NFS), “Arrays” (in the case of block level protocols such as iSCSI or Fiber Channel), or more generally as “Storage Targets”, are generally single physical devices that contain disks and computing capabilities, attach to an enterprise network, and store data. In this model a vendor tightly couples the storage server software with the specific hardware that it will be packaged on and sells the entire unit as a package. Popular examples here include NFS servers from Network Appliance, or arrays from EMC, HP, or IBM.
In clustered or parallel file and storage systems, the storage software is spread across many physical devices. Systems typically divide responsibility between a small, important number of very important hosts that handle control messages and requests for important, highly contended data, and a second class of servers that just store data. The first tier of servers is often referred to, in the case of clustered file systems, as metadata servers. Clustered systems may be packaged completely as software as is the case with systems such as Lustre, Glustre, CLVM, or the Google File System or as hardware, such as Panasas, Isilon, or iBricks.
Some more recent systems have explored peer-to-peer style storage, or overlay network-based storage, in which a collection of individual devices achieve some degree of self-organization by dividing a large virtual storage address space among themselves. These systems often use Distributed Hash Tables (DHTs) and the application of hash functions to either data or data addresses in order to distribute data over a large collection of hosts in order to achieve scalability. Examples of these systems include file systems such as Ceph, Corfu, and the Fast Array of Wimpy Nodes (FAWN) prototypes, which combine purpose-designed hardware and software.
These classifications are not meant to perfectly taxonomize storage systems, but rather to show that while a number of approaches have been taken to the construction of storage systems, they have all been implemented effectively as software server applications that may or may not include end server hardware. As such, these designs all hinge on the fact that logic in the end systems is where enterprise storage should be implemented. They are designed with the assumption that relatively simple and general purpose underlying networks (even storage specific networks such as fibre channel) are sufficient to build reliable, high-performance storage.
Although it is possible to construct a very high performance monolithic storage system with a great deal of bandwidth and fairly low latency, it is difficult for such a system to compete with the latency and bandwidth of local device buses on modern hardware, such as PCIe. In approaches described herein, resources may be provisioned on the host for the best possible performance, while still providing availability (location transparency, replication). Disclosed systems make efficient uses of resources that are already present (fast storage, switching, and host bandwidth, CPU) to provide a high-performance storage target at much lower cost than a dedicated monolithic appliance. Further, monolithic storage systems invariably add an unnecessary bottleneck to the design of networked storage systems. Where a single end system (the storage target) is required to serve request traffic from multiple clients, it must scale in performance in order to satisfy the demands of that request load. Scaling a single end system in this manner is challenging for a number of reasons, including (as only a few simple examples) both bandwidth of network connections to the collection of clients, bandwidth and latency of access to its local persistent storage devices, CPU and memory demands in order to process and issue individual request traffic.
Recent years have seen a fundamental set of changes to the technical capabilities of enterprise computing: In particular: (a) non-volatile memories, such as Flash-based technologies have become fast, inexpensive, and connected directly to individual computers over high-speed busses such as PCIe; (b) Server CPUs have become increasingly parallel, often possessing additional cores that may be dedicated to the management of specific devices such as network interfaces or disks, these core may directly manage a subset of PCIe devices on a system; (c) network switching hardware has become faster, more capable, and more extensible.
Projects such as OpenFlow, and Commercial products, including Arista Networks' Extensible Operating System (EOS) allow new, software-based functionality to be pushed onto the network forwarding path. All three of these factors characterize commodity hardware, and reflect trends that will increase in the years to come.
It is no longer sensible to think of storage architectures as systems that are implemented on end hosts at the other end of the network from the applications that consume them. It is also no longer sensible to consider high-performance storage as an application server that is implemented above a general-purpose network. These assumptions are common in virtually all storage systems that are designed and sold today, and do not reflect the realities of emerging hardware.
In distributed storage systems, it is assumed that all participants of the system are effectively independent, and may communicate with each other in arbitrary manners. As a result, in the event of a loss of connection to a small number of nodes, it is hard to disambiguate between the case where those nodes have all simultaneously failed, and the case where the network has become partitioned, leaving those nodes alive, but unable to communicate with the rest of the system. Similarly, a decision to move a piece of data stored on one node to reside on another typically requires that all nodes “agree” and that there is no cached state that might result in a node reading or writing a stale copy of that piece of data.
Known distributed memory systems access data over networks, and maintain some relationship between data addresses and network addresses. In NFS and SMB, for instance, a file is located at “server_address:/mount/point/file_name.ext”. Block-level protocols such as iSCSI use similar techniques. Some research systems, for instance the Chord DHT, FAWN, and Microsoft's Flat Datacenter Storage (FDS) use a hash function to map a data address onto a specific network host address. For example, a list of n hosts might be stored in a table, and then when accessing a specific piece of data, the host performing the access would calculate:destination table index=hash_function(data address)modulo n 
This methodology results in the hash function evenly, but semi-randomly, distributing load over the hosts in the table. In these cases, requests are still sent specifically to end hosts, leading to considerable complexity in activities such as (a) adding or removing hosts from the cluster, (b) responding to the failure of individual hosts, (c) moving specific pieces of data, for instance to rebalance load in the face of hot spots.
In known network switches, deciding where to send writes in order to distribute load in a distributed system has been challenging; techniques such as uniform hashing have been used to approximate load balancing. In all of these solutions, requests have to pass through a dumb switch which has no information relating to the distributed resources available to it and, moreover, complex logic to support routing, replication, and load-balancing becomes very difficult since the various memory resources must work in concert to some degree to understand where data is and how it has been treated by other memory resources in the distributed hosts.
Storage may be considered to be increasingly both expensive and underutilized. PCIe flash memories are available from numerous hardware vendors and range in random access performance from about 50K to about 1M Input/Output Operations per Second (“IOPS”). At 50K IOPS, a single flash device consumes 25 W and has comparable random access performance to an aggregate of 250 15K enterprise-class SAS hard disks that consume 10 W each. In enterprise environments, the hardware cost and performance characteristics of these “Storage-Class Memories” associated with distributed environments may be problematic. Few applications produce sufficient continuous load as to entirely utilize a single device, and multiple devices must be combined to achieve redundancy. Unfortunately, the performance of these memories defies traditional “array” form factors, because, unlike spinning disks, even a single card is capable of saturating a 10 Gb network interface, and may require significant CPU resources to operate at that speed. While promising results have been achieved in aggregating a distributed set of nonvolatile memories into distributed data structures, these systems have focused on specific workloads and interfaces, such as KV stores or shared logs, and assumed a single global domain of trust. Enterprise environments have multiple tenants and require support for legacy storage protocols such as iSCSI and NFS. The problem presented by aspects of storage class memory may be considered similar to that experienced with enterprise servers: Server hardware was often idle, and environments hosted large numbers of inflexible, unchangeable OS and application stacks. Hardware virtualization decoupled the entire software stack from the hardware that it ran on, allowing existing applications to more densely share physical resources, while also enabling entirely new software systems to be deployed alongside incumbent application stacks.
Therefore, a solution that achieves the cost and performance benefits of distributed storage, without incurring the associated complexity of existing distributed storage systems is desirable.
The examples and objectives described above are included solely to advance the understanding of the subject matter described herein and are not intended in any way to limit the invention to aspects that are in accordance with the examples or improvements described above.