Typically, in multi-processing computing systems, non-volatile memory/storage, “NVM”, such as fast Flash and Storage Class Memory is packed/interfaced as fast hard disk drive (SSD).
Further, some multi-processing and parallel computing systems currently implement networking standards or protocols, to maintain connections between multiple computer nodes (e.g., high performance computing nodes) and I/O nodes such as storage devices, e.g., hard disk drive memory storage devices. The Internet protocol suite based on the TCP and IP protocols is just one example. Other examples for networked storage device I/O connection standards for accessing hard disk storage devices and SSDs include the iSCSI protocol, which is an upper layer protocol based on TCP/IP. Running these protocols to exchange data between computing systems or computing systems and storage systems typically involves overhead due to copying the data to be communicated between involved application programs and the network protocol stack, and within the protocol stack.
RDMA (Remote Direct Memory Access) is a communication paradigm to overcome this performance problem while transferring content of local memory to a peer hosts remote memory or vice-versa without involving either one's operating system during the actual data transfer and thus avoiding any data copy operation otherwise needed. Several protocol suites exist to implement an RDMA communication stack. Infiniband ® (Trademark of System I/O, Inc., Beaverton, OR), iWarp, and RoCEE (RDMA over Converged Enhanced Ethernet) are three example network technologies which can be deployed to implement an RDMA stack. These technologies use different network link technologies and different network packet formats to exchange RDMA messages between hosts.
Further, there currently exist switched fabric technologies infrastructure for server and storage connectivity such as the OpenFabrics Enterprise Distribution (OFED™ (Trademark of Open Fabrics Alliance, INC. California). OpenFabrics is an industry standard framework for a host implementation of the RDMA communication paradigm comprising the definition of an application programming interface (RDMA ‘verbs’ API) and generic user level and operating system level components to which network technology specific and vendor specific components can be attached in a standardized way. OpenFabrics is open-source software for RDMA and kernel bypass applications for use in high performance, highly efficient networks, storage connectivity and parallel computing. The OFED™ programming interface allows an application to access memory of a remote machine via RDMA directives such as RDMA Read, RDMA Write, RDMA Send and RDMA Receive.
FIG. 1 particularly shows an example prior art model 10 implementing OFED™ connectivity standard used in context of RDMA in a network 12 having nodes that execute tasks and may access other networked nodes. Particularly, in the embodiment depicted in FIG. 1, host devices, e.g., networked computing devices, i.e., “peerA” device 16, and “PeerB” device 18, etc. on two different machines, and the communications network 12 that interconnects them create RDMA connections by first registering their respective virtual memory regions 26, 28. Then the peers 16,18, according to OFED™ standard directives, may perform RDMA READ/WRITE operations directly into a remote peer's virtual memory address space. For example, peer 16 may access virtual memory region 28 of peer 18 device via RDMA commands 27, and peer 18 may access virtual memory region 26 of peer 16 device via RDMA commands 25. The respective Infiniband adapters 22, 24 each performs the RDMA operations responsive to commands from the remote peer CPU and control path elements. The work queues 17 in PeerA and work queues 19 in PeerB have entries such as RDMA read or write directives and a respective control paths 13, 15 provides connectivity to trigger the respective Infiniband adapter to perform the RDMA.
Further, there currently exists a NVMe (Non-volatile memory Express) (www.nvmexpress.org/) describing a new standard to access PCI-attached NVM SSD's. This standard is based on an asynchronous multi-queue model, however, is still block based accessed (e.g., in a multiple byte unit such as 512 bytes, 4096 bytes, 16 Kilobytes etc.). That is, access to the fast Flash and Storage Class Memory (NVM) as persistent memory/storage is slowed down by classic “block access” methods developed for mechanical media (e.g., hard disks) in currently existing systems. This is a problem in that implementing block access methods increases NVM memory access and storage times.
With more particularity, current host controller devices, such as device 35 of FIG. 2, provide interfaces such as AHCI (HBA), SCSIe, or NVMe for PCI Express attached SSDs. In the case of NVMe, the controller 35 includes functionality for partitioning w/multiple ports, providing parallel IO, Scatter-gather-support, and up to 64′000 I/O Command (Send) Queues, with a maximum command depth of 64′000 entries, and Completion Queues.
FIG. 2 shows an example of the NVMe standard block device access for a multi-core (multiprocessing) system 30 which interprets memory as a hard disk. It does not provide byte addressability. In FIG. 2, a user application via a controller interface 35 creates a work request(s) e.g., write, write a block of data to a specified location, e.g., a sector x, and fetch data for user at an address “y”. This work request(s) is(are) placed in a respective command queue or send queue (SQ) such as SQ 31 associated with a first computing device (e.g., processor Core, and SQ 34 associated with another computing device (e.g., processor Core n). As shown, each core 31, 34 of the multi-core (multiprocessing) system is in communication with the controller device 35 that is configured to process the queue entries for each host or core. The notification(s) are received at a respective completion queues (CQ) 33, 36 of each respective core, and the application reads/processes the CQ to know when the data request is completed.
As shown in FIG. 3, an OFED™ software stack framework 50 includes kernel-level drivers 52, channel-oriented RDMA and send/receive operations, kernel bypasses of the operating system 53, both kernel level application programming interface (API) 55 and user-level API 57 and providing services for parallel message passing (MPI), sockets data exchange (e.g., RDS, SDP), NAS and SAN storage (e.g. iSER, NFS-RDMA, SRP) and file system/database systems. The network and fabric technologies that provide RDMA performance with OFED™ include: legacy 10 Gigabit Ethernet, iWARP, RoCE and 10/20/40 Gigabit InfiniBand.
The OFED™ framework defines access to remote memory at byte granularity and thus avoids the drawbacks of block-based access of protocols such as NVMe. Nevertheless, the OFED™ framework is currently defined for only accessing remote computer memory via a network link, and thus cannot be used to access local Non Volatile Memory.
A new byte oriented access method for local NVM is necessary. This access method must support highly parallel or multicore systems.