Some network technologies (e.g., 1 Gb Ethernet, TCP, etc.) may provide the ability to move data between different memories in different systems. When network speeds began increasing beyond approximately 100 Mbps, network interface cards (NICs) were adapted to provide direct memory access (DMA) techniques to limit system overhead for locally accessing the data over the network. Virtual memory operating systems (e.g., Windows and Unix) provide for addressing memory in addition to the physical system memory. A unit of information can, for example, either be present in the physical memory (i.e., “pinned down”) or may be swapped out to disk. A DMA device typically accesses only physical memory, and therefore, the operating system should guarantee that the unit of information to be moved over the network is “pinned down” in physical memory before the NIC can DMA the information. That is, a particular block of memory may be configured such that the block of memory cannot be moved or swapped to a disk storage.
FIG. 1 shows a block representation of a conventional system in which data is copied from a pinned buffer in a first host to a pinned buffer in a second host. The first host 10 includes a pinned buffer 20, a driver 30 and a NIC 40. The pinned buffer 20 and the driver 30 are each coupled to the NIC 40. The second host 50 includes a pinned buffer 60, a driver 70 and a NIC 80. The pinned buffer 60 and the driver 70 are each coupled to the NIC 80. The NIC 40 is coupled to the NIC 80 via a network 90. The driver in this example may take many forms, such as, for example, a stand-alone driver or a driver as part of a more comprehensive software package.
In operation, the driver 30 or other software in the host 10 writes a descriptor for a location of the pinned buffer 20 to the NIC 40. The driver 70 or other software in the host 50 writes a descriptor for a location of the pinned buffer 60 to the NIC 80. The driver 30, works with the operating system and other software and hardware in the system to guarantee that the buffers 20 are locked into physical host memory (i.e., “pinned”). The NIC 40 reads data from the pinned buffer 20 and sends the read data on the network 90. The network 90 passes the data to the NIC 80 of the host 50. The NIC 80 writes data to the pinned buffer 60.
Conventionally, different and incompatible upper layer protocol (ULP) applications may be used to perform a particular data transfer. For example, a storage application defined according to a storage protocol such as Internet Small Computer System Interface (iSCSI) may provide a particular data transfer using an iSCSI network. In another example, a database application defined according to a remote direct memory access protocol (RDMAP) may provide a particular data transfer using an RDMA network. However, iSCSI was developed and optimized for general storage such as in a storage networks. In contrast, RDMA was developed and optimized for different purposes such as, for example, interprocess communications (IPC) applications. Unfortunately, conventional systems have been unable to efficiently combine some of the advantageous features of iSCSI and RDMA into a single ULP application using a single network. For example, conventional iSCSI systems have proven to be inflexible when applied to non-storage applications and conventional RDMA systems have not been developed to efficiently provide data storage as already provided in conventional iSCSI systems.
FIG. 2 shows a flow diagram of a conventional storage network system using iSCSI. In operation, data is written from Host 1 to Host 2. This operation may be similar, but is not limited to, the functionality exhibited by disk Host Bus Adapter (HBA) devices. In path 100, a driver on Host 1 writes a command to a command queue (e.g., a ring) that requests that the contents of a set of pre-pinned buffers be written to a specific disk location in Host 2. In path 110, NIC 1 reads the command from the queue and processes it. NIC 1 builds a mapping table for the pinned buffers on Host 1. The mapping table is given a handle, for example, “Command X.” In path 120, NIC 1 sends a write command to Host 2 that requests that data be pulled from “Command X” of Host 1 into a location on the disk in Host 2. The write command also requests that Host 2 inform Host 1 when the write command has been completed. In path 130, NIC 2 of Host 2 receives the write command and passes the write command to a driver for processing through a completion queue. In path 140, the driver of Host 2 reads the command and allocates buffers into which data may temporarily be stored. In path 150, the driver writes a command to the NIC command queue that “the allocated buffers be filled with data from the ‘Command X’ of Host 1.” It is possible that paths 130-150 can be executed entirely by NIC 2 if the driver pre-posts a pool of buffers into which data may be written.
In path 160, NIC 2 processes the pull command. NIC 2 builds a mapping table for the pinned buffers on Host 2 and creates a handle, for example, “Command Y.” A command is sent to NIC 1 requesting “fill Command Y of Host 2 with data from “Command X” of Host 1.” The sent command can be broken up into a plurality of commands to throttle data transfer into Host 2. In path 170, as NIC 1 receives each command, NIC 1 uses its mapping table to read data from Host 1. In path 180, NIC 1 formats each piece of read data into packets and sends the packets to Host 2. In path 190, as NIC 2 receives each pull response, NIC 2 determines where to place the data of each pull response using its mapping table and writes the data to Host 2. In path 200, after all the data has been pulled, NIC 2 writes a completion command to the Host 2 driver that says that it has completed the pull command specified in path 150. In path 210, Host 2 reads the command response and processes the data in the buffers to disk (path 211).
In path 220, when the data has been processed and the buffers on Host 1 are no longer needed, Host 2 writes a status command to NIC 2. The command states that the command received in path 140 has been completed and that “Command X” of Host 1 can be released. In path 230, NIC 2 reads the status command. In path 240, the status command is sent to Host 1. In path 250, NIC 1 receives the status command that indicates that the buffers associated with “Command X” of Host 1 are no longer needed. NIC 1 frees the mapping table associated with “Command X” of Host 1. Once the internal resources have been recovered, the status is written to the completion queue on Host 1. In step path, the driver of Host 1 reads the completion and is informed that the command requested in path 100 is complete.
FIG. 3 shows a flow diagram of a conventional storage system implementation using a remote direct memory access protocol (RDMAP). In this exemplary operation, data is written from Host 1 to Host 2. In path 270, a driver requests the Operating System to pin memory and develop a table. The driver may also, in an additional path, request the NIC to register the memory. In path 280, NIC 1 responds with a region identification (RID) for the table in NIC 1. In path 290, the driver requests that a window be bound to a region. In one conventional example, the system may not employ a window concept and thus may not need to perform steps related to the window concept. In path 300, NIC 1 responds with the STag value that corresponds to the bound window/region pair. In path 310, the driver formats a write request packet and places the STag value somewhere within the packet as set forth by a particular storage protocol. In path 320, NIC 1 sends the write message to NIC 2, advertising the buffer available on Host 1.
In path 330, NIC 2 receives the send message and posts the message to the driver. A driver on Host 2 then processes the send message and determines that data must be pulled to satisfy the command represented inside the send message. The driver, on path 340, queues an RDMA read command to NIC 2 to pull data from the STag on NIC 1 into the pinned memory on NIC 2. The Host 2 memory is pre-pinned. In path 350, NIC 2 processes the RDMA read command and sends a RDMA read request message to NIC 1. In path 360, NIC 1 receives and processes the RDMA read request message. NIC 1 responds by reading the data from the pinned memory on Host 1 as set forth by the internal pinned memory table. In path 370, RDMA read response data is transmitted to NIC 2. In path 380, NIC 2 writes data to the pinned memory of Host 2 for each RDMA read response it gets. Host 1 receives no indication of the progress of the writing of data into the Host 2 pinned memory. The operations indicated by paths 350-370 may be repeated as many times as Host 2/NIC 2 deem necessary. In path 390, on the last RDMA read response, NIC 2 indicates the RDMA read completion to the driver on Host 2.
In path 400, the driver on Host 2 formats and posts a send command that indicates that the command request sent on path 320 have been completed and that the STag value is no longer needed. In path 410, NIC 2 sends the message to NIC 1. In path 420, NIC 1 receives the send message and indicates the send message to the driver on Host 1. NIC 1 is not aware that STag information was passed within the send message. NIC 1 is not adapted to correlate the send message sent on path 410 with the write message sent on path 320. In path 430, the driver or the ULP on Host 1 knows the command is complete and releases the resources in NIC 1. Host 1 issues an unbind command to release the STag value. In path 440, NIC 1 responds that the STag is now free. In path 450, the driver on Host 1, informed that it is done with the region, requests that the resources be freed. In path 460, NIC 1 responds that the last resource has been freed.
Combining the functionality of RDMA technologies with iSCSI technologies into a single technology has proven difficult due to a number of incompatibilities. The iSCSI technologies are inflexible when used for other purposes, because iSCSI technologies are optimized for storage. For example, iSCSI technologies must pin memory for each data transfer. In contrast, RDMA technologies provide that a plurality of data transfers may reuse the pinned memory. The RDMA technologies suffer from a host/NIC interface that does not seemlessly approach the host/NIC interface of iSCSI technologies. For example, the initiation process according to RDMA requires a plurality of passes (e.g., paths 270-310), while the initiation process according to iSCSI is accomplished in a single pass (e.g., paths 100-110). Furthermore, RDMA suffers from a substantial delay during the initiation process (e.g., paths 270-310) and the completion processes (e.g., paths 420-260).
RDMA technologies (e.g., Infiniband) may differ from iSCSI technologies in other ways. Additional steps may be needed to advertise buffers through a dereferenced STag window. Building a pinned memory table (e.g., STag creation) may be isolated from the line traffic processes (e.g., send processes). When a send message is posted, the NIC may not be aware of the STag value that is passing within the send message.
Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of ordinary skill in the art through comparison of such systems with some aspects of the present invention as set forth in the remainder of the present application with reference to the drawings.