Non-volatile memory express (NVMe) defines a register-level interface for host software to communicate with a non-volatile memory subsystem (e.g., an SSD) over a peripheral component interconnect express (PCIe) bus. NVMe over fabrics (NVMeoF) (or NVMf in short) defines a common architecture that supports an NVMe block storage protocol over a wide range of storage networking fabrics such as Ethernet, Fibre Channel, InfiniBand, and other network fabrics. NVMeoF is compatible with different network stacks for Transmission Control Protocol (TCP)/Internet Protocol (IP) and remote direct memory access (RDMA) over the underlying fabrics.
Many large-scale services (e.g., cloud-based services) that target a variety of applications can be hosted by a number of servers in a datacenter. Such services are often required to be interactive, resulting in sensitivity to a response time. Therefore, high-performance storage devices that can have low data access latency while providing a high throughput become prevalent in today's datacenters.
In particular, NVMe-based SSDs and NVMeoF devices are becoming storages of choice for a datacenter due to their high bandwidth, low latency, and excellent random input/output (I/O) performance. However, those high-performance storage devices can introduce periodic latency spikes due to background tasks such as garbage collection. On the other hand, multiple services co-located on the same server can increase latency unpredictability when applications running the services compete for shared system resources such as central processing units (CPUs), memory, and disk bandwidth of the storage devices over the underlying fabric. The latency spikes and unpredictability can lead to a long tail latency that can significantly decrease the system performance.
Workload scheduling is a critical issue for a multi-tenant application server that distributes resources to the tenant applications. An application container controls an application instance that runs within a type of virtualization scheme. This is referred to as container-based virtualization. In container-based virtualization, the individual instances of the application can share an operating system (OS) of the system with simply different code containers for libraries and other resources.
A typical large-scale system of a datacenter has a datacenter-level scheduler. The datacenter-level scheduler centralizes decision-making for workload migration by taking into account the application's QoS requirement and the underlying server-level resources including CPU cores and memory resources. However, the server-level resources provide limited support for storage system resources. In general, the datacenter-level scheduler attempts to minimize data movement from one storage device to another storage device. For example, when migrating a workload, the datacenter-level scheduler selects a target node among a plurality of candidate nodes based on the proximity to the current node where the data is stored and/or the bandwidth available for the data movement from the current node to the target node.
While the datacenter-level scheduler can provide global-level resource visibility and complex scheduling algorithms, it suffers from several shortcomings. First, the datacenter-level scheduler fails to account for high-performance storage drivers that have lower latency. The high-performance storage drivers can support storage devices of high storage capacity and efficiently share server resources to manage and orchestrate various internal tasks of the storage devices such as garbage collection, wear leveling, bad block remapping, write amplification, overprovisioning, etc. However, the datacenter-level scheduler does not efficiently utilize the high-performance storage driver to its maximum capability. In addition, the datacenter-level scheduler introduces additional complexity when taking corrective actions in cases where the scheduler incorrectly locates workloads across the rack systems in the datacenter. Although the datacenter-level scheduler can perform corrective actions at a datacenter level, it cannot efficiently utilize the data locality and remote execution capability that the latest storage device protocols support.