Distributed block storage systems provide block device functionality to applications by presenting logical block devices that are stored in segments scattered across a large pool of remote storage devices. To use these logical block devices, applications need to determine the location of all the segments they need to access. Querying a directory service for the segment location before each I/O request greatly increases access latency. Determining them all in advance places unacceptable overhead on the system keeping that location information up to date. Popular large-scale distributed storage systems like Ceph (available at www.ceph.com) and Gluster (available at docs.gluster.org) use consistent hashing to minimize the cost of determining logical block device segment locations on demand. Unfortunately, these hashing techniques can't be used throughout large scale distributed storage in datacenters. Neither of these are standard storage protocols, and both require specific software in the client device to enable access to this storage. The client device software has a significant runtime and operational cost. These techniques also require the client device using the storage to have access to the storage cluster. Some client devices are untrusted, so that form of access presents an unacceptable security risk.
The problem of performance overhead in the client device is exacerbated when the client device runs on a limited resourced location such as a smart network interface card (NIC) or offloaded device. Data centers can be required to deploy large numbers of gateway machines to enable applications running on client systems to use the distributed block storage service. This adds to latency and inefficient use of network resources because of the extra hops that are required to get to the actual data node.
One approach is to use native distributed storage client devices. Any storage node can run a block device client. Virtual machines (VMs) can be isolated from this via distributed block gateways integrated into the hypervisor. For “bare metal” computing systems or containers there are some kernel implementations, but often a user mode gateway is required. These local gateways are not lightweight and require the cluster administrator to trust the node that runs them.
Another approach is to use dedicated storage gateways. Isolation for untrusted and bare metal applications can be accomplished by using a large number of dedicated gateways (e.g., those using Internet Small Computer Systems Interface (iSCSI)). These appear to the application like a traditional storage array. They must collectively provide high availability, multipath I/O (MPIO), and load balancing just like a traditional storage array. However, this adds one network hop for all storage operations (i.e., initiator to gateway, and gateway to cluster), thereby decreasing system efficiency.
Yet another approach is to use distributed clients in a smart NIC. This includes a storage client like Ceph Reliable Autonomic Distributed Object Store (RADOS) block device (RBD) in the smart NIC and present the RBD volume to the bare metal host, container, or VM as a standard hardware block (e.g., a non-volatile memory express (NVMe) device supporting the Non-Volatile Memory Express (NVMe) Specification, revision 1.3c, published on May 24, 2018 (“NVMe specification”) or later revisions). This solves application connectivity issues by using NVMe as a common protocol but requires a complex NIC implementation. NICs with enough processing cores to perform this processing may consume too much of the power and cooling budget of a compute host housing the NIC to an unacceptable degree. The Ceph client code is fairly complex, and best treated as a package that can be updated with the rest of the Ceph cluster. When embedded as a NIC offload, that may become difficult as cluster administrators or tenants can't necessarily be trusted to manage software embedded in the MC that enforces isolation of tenants from each other and the datacenter management network.