Storage protocols have been designed in the past to provide reliable delivery of data. Examples include Fibre channel (FC), Fibre Channel over Ethernet (FCoE), and iSCSI, including RDMA-capable transports (e.g., Infiniband™, etc). NVMe is a relatively recent storage protocol that is designed for a new class of storage media, such as NAND Flash™, and the like. As the name NVMe (Non volatile Media—express) suggests, NVMe is a protocol highly optimized for media that is close to the speeds of DRAM, as opposed that of to Hard Disk Drives (HDDs). NVMe is typically accessed on a host system via a driver over the PCIe interface of the host. However, as noted above, methods and systems disclosed herein provide for accessing NVMe over a network. Since the latency of DRAM and similar media is orders of magnitude lower than that of HDDs, the approach for accessing NVMe over a network may preferably entail minimal overhead (in terms of latency). As such, there is a need to design a protocol to access NVMe devices over the network via a lightweight protocol.
Also, NVMe is designed to operate over a PCIe interface, where there are hardly any packet drops. So, the error recovery mechanisms built into conventional NVMe are based primarily on large I/O timeouts implemented in the host driver. To enable use of NVMe over a network, a need exists to account for errors that result from packet drops.
The proliferation of scale-out applications has led to very significant challenges for enterprises that use such applications. Enterprises typically choose between solutions like virtual machines (involving software components like hypervisors and premium hardware components) and so-called “bare metal” solutions (typically involving use of an operating system like Linux™ and commodity hardware. At large scale, virtual machine solutions typically have poor input-output (IO) performance, inadequate memory, inconsistent performance, and high infrastructure cost. Bare metal solutions typically have static resource allocation (making changes in resources difficult and resulting in inefficient use of the hardware), challenges in planning capacity, inconsistent performance, and operational complexity. In both cases, inconsistent performance characterizes the existing solutions. A need exists for solutions that provide high performance in multi-tenant deployments, that can handle dynamic resource allocation, and that can use commodity hardware with a high degree of utilization.
FIG. 1 depicts the general architecture of a computing system 102, such as a server, functions and modules of which may be involved in certain embodiments disclosed herein. Storage functions (such as access to local storage devices on the server 102, such as media 104 (e.g., rotating media or flash) and network functions such as forwarding have traditionally been performed separately in either software stacks or hardware devices (e.g., involving a network interface controller 118 or a storage controller 112, for network functions or storage functions, respectively). Within an operating system stack 108 (which may include an operating system and a hypervisor in some embodiments including all the software stacks associated with storage and networking functions for the computing system), the software storage stack typically includes modules enabling use of various protocols that can be used in storage, such as the small computer system interface (SCSI) protocol, the serial ATA (SATA) protocol, the non-volatile memory express (NVMe) protocol (a protocol for accessing disk-attached storage (DAS), like solid-state drives (SSDs), through the PCI Express (PCIe) bus 110 of a typical computing system 102) or the like. The PCIe bus 110 may provide an interconnection between a CPU 106 (with processor(s) and memory) and various IO cards. The storage stack also may include volume managers, etc. Operations within the storage software stack may also include data protection, such as mirroring or RAID, backup, snapshots, deduplication, compression and encryption. Some of the storage functions may be offloaded into a storage controller 112. The software network stack includes modules, functions and the like for enabling use of various networking protocols, such as Transmission Control Protocol/Internet Protocol (TCP/IP), the domain name system protocol (DNS), the address resolution protocol (ARP), forwarding protocols, and the like. Some of the network functions may be offloaded into a network interface controller 118 (or NIC) or the network fabric switch, such as via an ethernet connection 120, in turn leading to a network (with various switches, routers and the like). In virtualized environments, a NIC 118 may be virtualized into several virtual NICs as specified by SR-IOV under the PCI Express standard. Although not specified by the PCI Express standard and not as common, storage controllers can also be virtualized in a similar manner. This approach allows virtual entities, such as virtual machines, access to their own private resource.
Referring to FIG. 2, one major problem with hypervisors is with the complexity of IO operations. For example, in order to deal with an operation involving data across two different computers (computer system 1 and computer system 2 in FIG. 2), data must be copied repeatedly, over and over, as it moves among the different software stacks involved in local storage devices 104, storage controllers 112, the CPUs 106, network interface controller 118 and the hypervisor/operating systems 108 of the computers, resulting in large numbers of inefficient data copies for each IO operation whenever an activity is undertaken that involves moving data from one computer to another, changing the configuration of storage, or the like. The route 124 is one of many examples of the complex routes that data may take from one computer to another, moving up and down the software stacks of the two computers. Data that is sought by computing system 2 may be initially located in a local storage device 104, such as a disk, of computing system 1, then pulled by a storage controller card 112 (involving an IO operation and copying), send over the PCIe bus 110 (another IO operation) to the CPU 108 where it is handled by a hypervisor or other software component of the OS stack 108 of computing system 1. Next, the data may be delivered (another IO operation) through the network controller 118 and over the network 122 (another set of IO operations) to computing system 2. The route continues on computing system 2, where data may travel through the network controller 118 and to the CPU 106 of computing system 2 (involve additional IO operations), then sent over the PCIe bus 110 to the local storage controller 112 for storage, then back to the hypervisor/OS stack 108 for actual use. These operations may occur across a multiplicity of pairs of computing systems, with each exchange involving this kind of proliferation of IO operations (and many other routes are possible, each involving significant numbers of operations). Many such complex data replication and transport activities among computing systems are required in scaleout situations, which are increasingly adopted by enterprises. For example, when implementing a scaleout application like MongoDB™, customers must repeatedly run real time queries during rebalancing operations, and perform large scale data loading. Such activities involve very large numbers of IO operations, which result in poor performance in hypervisor solutions. Users of those applications also frequently re-shard (change the shards on which data is deployed), resulting in big problems for bare metal solutions that have static storage resource allocations, as migration of data from one location to another also involves many copying and transport operations, with large numbers of IO operations. As the amount of data used in scaleout applications grows rapidly, and the connectedness among disparate systems increases (such as in cloud deployments involving many machines), these problems grow exponentially. A need exists for storage and networking solutions that reduce the number and complexity of IO operations and otherwise improve the performance and scaleability of scaleout applications without requiring expensive, premium hardware.
Referring still to FIG. 2, for many applications and use cases, data (and in turn, storage) needs to be accessed across the network between computing systems 102. Three high-level steps of this operation include the transfer of data from the storage media of one computing system out of a box, movement across the network 122, and the transfer of data into a second box (second computing system 102) to the storage media 104 of that second computing system 102. First, out of the box transfer, may involve intervention from the storage controller 112, the storage stack in the OS 108, the network stack in the OS 108, and the network interface controller 118. Many traversals and copying across internal busses (PCIe 110 and memory) as well as CPU 106 processing cycles are spent. This not only degrades performance (creating latency and throughput issues) of the operation, but also adversely affects other applications that run on the CPU. Second, once the data leaves the box, 102 and moves onto the network 122, it is treated like any other network traffic and needs to be forwarded/routed to its destination. Policies are executed and decisions are made. In environments where a large amount of traffic is moving, congestion can occur in the network 122, causing degradation in performance as well as problems with availability (e.g., dropped packets, lost connections, and unpredictable latencies). Networks have mechanisms and algorithms to avoid spreading of congestion, such as pause functions, backward congestion notification (BCN), explicit congestion notification (ECN), etc. However, these are reactive methods; that is, they detect formation of congestion points and push back on the source to reduce congestion, potentially resulting in delays and performance impacts. Third, once the data arrives at its “destination” computing system 102, it needs to be processed, which involves intervention from the network interface controller 118, the network stack in the OS 108, the storage stack in the OS 108, and the storage controller 112. As with out of the box operations noted above, many traversals and copying across internal busses as well as CPU 106 processing cycles are spent. Further, the final destination of the data may well reside in still a different box. This can be the result of a need for more data protection (e.g., mirroring or across-box RAID) or the need for de-duplication. If so, then the entire sequence of out-of-the box, across the network, and into the box data transfer needs to be repeated again. As described, limitations of this approach include degradation in raw performance, unpredictable performance, impact on other tenants or operations, availability and reliability, and inefficient use of resources. A need exists for data transfer systems that avoid the complexity and performance impacts of the current approaches.
As an alternative to hypervisors (which provide a separate operating system for each virtual machine that they manage), technologies such as Linux™ containers have been developed (which enable a single operating system to manage multiple application containers). Also, tools such as Dockers have been developed, which provide provisioning for packaging applications with libraries. Among many other innovations described throughout this disclosure, an opportunity exists for leveraging the capabilities of these emerging technologies to provide improved methods and systems for scaleout applications.
Another area in which current approaches are problematic is in the area of the strategies used to write data to individual solid state drives (SSDs) and to groups of SSDs) over time, where current “garbage collection” processes typically require moving significant amounts of data through a series of copying and pasting operations (entailing large numbers of I/O operations in conventional systems), such as to copy and paste all of the valid data from an old block that contains some invalid data into a new block, so that the old block can be erased in its entirety to make it available for writing of new data. For an application this “garbage collection” period results in an unpredictable response time. A need exists for more efficient storage strategies that reduce the number of operations required to write data to collections of SSDs, and also to minimize the response time variation for the application.