Conventional computerized devices such as personal computer systems, workstations, or the like require the ability to transmit data between components within, and attached to such computerized devices at very high rates of speed. As an example, consider a typical conventional workstation containing one or more processors, one or more memory systems and possibly a variety of peripheral input output components such as storage devices (e.g., floppy disks, hard disks, CD-ROM drives, etc.), network interface connections (e.g., modems or Ethernet network interface cards), video display devices, audio input output devices (e.g., soundcards), instrumentation adapters and so forth. A conventional data bus that interconnects such components within the computer system allows the components to exchange data with each other (e.g., read and/or write data) and also allows one component, such as a processor, to control operation of another component such as a memory system or a video display card. Generally, a conventional data bus or interconnection architecture includes a collection of communications hardware such as a network interface card or microprocessor, ports, adapters, physical data links and/or connections that couple various devices or components within the computer system. Such conventional interconnect architectures also include software or firmware processes (e.g., embedded programs) that operate one or more input output data communications protocols or signaling mechanisms to control communications over the interconnected communications hardware and data links coupled via the data bus.
One type of conventional data bus that computer and device designers utilize to interconnect and allow components within a computer system to communicate is called a Peripheral Component Interconnect (PCI) bus. A PCI bus implements a shared bus architecture that allows a processor such as a central processing unit (CPU) operating within the computer system to control or arbitrate access to the PCI bus by components that need to transmit data on the bus. The PCI bus architecture operates at a preset or predefined speed (e.g., 100-Mhz) and forces a component on the PCI bus to share the total available bus bandwidth using various bus arbitration algorithms when communicating with another component. While the PCI bus approach is acceptable for use in many computing system environments, use of a PCI bus to exchange data between components in the computer system can encounter signal integrity and timing constraints that can limit the total speed available for communications between computerized device components. In addition, a conventional PCI bus is fixed in physical size (e.g., 32 bits, 64 bits, 128 bits) and does not scale well to allow for the addition of numerous other components or devices onto the bus beyond a number of available bus interface hardware connections or “slots” that a system designer initially provides in the computer system. Due to such limitations and to increasing performance requirements of modern day computer applications, computer engineers have developed another type of expandable data bus or interconnect architecture called Infiniband.
Infiniband is a conventional, industry standard, channel-based, switched fabric interconnect architecture designed for use in computer systems such as servers and peripherals devices such as storage devices, network interconnects, memory systems, and the like to allow high speed data access between such devices. A conventional Infiniband architecture operates much like a computer network in that each component, peripheral or device that operates in (i.e., that communicates over) the Infiniband architecture or network is equipped with an Infiniband channel adapter that operates as a network interface card to provide input output (I/O) onto one or more Infiniband communications channels or data links (i.e., physical links). The data links can be coupled to Infiniband switches or can directly couple to other Infiniband adapters. There is no limit to the number or types of components that may be coupled to the Infiniband fabric. Each Infiniband equipped component is generally referred to as a “node” and Infiniband nodes communicate using “channel adapters” coupled via point to point serial connections through Infiniband switches or routers that collectively form the Infiniband fabric. Host channel adapters (HCAs) are capable of interfacing with data communications applications in an operating system to couple servers or workstations as nodes to the Infiniband fabric. Target channel adapters (TCAs) exist within input output devices such as storage systems or other peripheral device nodes and can communicate with host channel adapters.
The Infiniband architecture supports multiple data paths between nodes thus providing for redundancy, congestion control and high data transfer rates. Current conventional Infiniband supports a 2.5 Gbps wire-speed connection in each direction on each wire and allows three different performance levels (1×, 4× and 12×) that correspond to three different possible physical connectivity characteristics between the channel adapters. For the 1× performance level which is the lowest performance available in Infiniband, there is one physical data link, wire or connection between adapters (for the total single wire bandwidth of 2.5 Gbps in each direction), whereas the 4× performance level provides four physical parallel links between adapters (for a total bandwidth of 12 Gbps in each direction), and the 12× performance level provides twelve physical parallel links between adapters (for a total bandwidth of 30 Gbps in each direction).
When transferring a block of data from one device to another using conventional communications protocols, latency arises in the form of overhead and delays that are added to the time needed to transfer the actual data. The major contributors to latency of a data transfer operation are the overhead of executing network protocol code within the operating system, context switches to move in and out of an operating system kernel mode to receive and send out the data, and excessive copying of data between user level buffers and memory within a network interface card that initially receives or transmits the data.
Infiniband uses packet communications to transfer data access commands between nodes and provides mechanisms that result in significant latency reduction as compared to other conventional data bus or interconnect architectures. Both host and target Infiniband channel adapters present an interface to layers of software and/or hardware above them that allow those upper layers to generate and consume packets directly. Since the Infiniband architecture is designed for use across high-bandwidth links that have very high reliability, Infiniband significantly eliminates processing requirements such as special case network protocol code that introduce latency into communications. As a result, the Infiniband protocol is defined to avoid operating system kernel mode interaction and interrupts during data transfers thus allowing for direct memory access (DMA) to the channel adapter memory from user mode applications. Because of the direct access to the adapter, Infiniband avoids unnecessary copying of the data into kernel buffers since the user is able to directly access data from user-space via the channel adapter. In addition to the standard send/receive operations that are typically available in a networking protocol, Infiniband provides Remote Direct Memory Access (RDMA) operations such as Read and Write where the initiator node of the operation specifies both the source and destination of a data transfer, resulting in zero-copy data transfers with minimum involvement of the main processors in a node.
Specifically, in order for an application to communicate with another application over InfiniBand, the application must first create a work or request queue that consists of a queue pair (QP) for sending and receiving data (i.e., a send queue and a receive queue). In order for the application to execute a data access operation such as an RDMA read or RDMA write operation to another node, it must place a work queue element (WQE) in the work queue. From there, the Infiniband channel adapter operates a scheduler that picks up the work queue element operation for execution. Therefore, the work queue forms the communications medium or interface between user applications and the channel adapter, relieving the operating system from having to deal with this responsibility.
Each application process may create one or more QPs for communications purposes with another application on other nodes. Instead of having to arbitrate for the use of the single queue for a conventional network interface card as in a typical operating system that uses a PCI bus, for example, Infiniband has multiple queues called queue pairs. To service the queue pairs in conventional Infiniband, one or more contexts may be used to process the work queue elements in those queue pairs. Generally, a context defines context resources (e.g., processing resources and other associated queue pair and work queue element state information) used to process work queue elements that appear in queue pairs. In conventional Infiniband, when a work queue element appears in a queue pair, a scheduler in the channel adapter assigns a context (i.e., a set of channel adapter resources) to process that work queue element to full completion of the data transfer task, and thereafter can reassign those context resources to process another work queue element of another queue pair for another data transfer task. Queue pairs and associated context resources can be implemented in hardware within a channel adapter, thereby off-loading most of the work required for data transfers from the CPU. Once a work queue element has completed the data transfer, the context may place a completion queue element (CQE) in a completion queue to notify the user application that the data transfer operation is complete and that the application can now access memory to retrieve the results of the operation. Once the queue pair element has been processed to completion, the context is free to be reassigned to another queue pair. The advantage of using the completion queue for notifying the caller of completed work queue elements is because it reduces the interrupts that would be otherwise generated to the operating system.
The list of remote access commands or operations supported by the conventional InfiniBand architecture at the transport level for Send Queues are as follows:                1. Send/Receive: supports a typical send/receive operation where one node submits a message and another node receives that message. One difference between the implementation of the send/receive operation under the InfiniBand architecture and more traditional networking protocols is that InfiniBand defines the send/receive operations as operating against queue pairs.        2. RDMA-Write: this operation permits one node to write data directly into a memory buffer on a remote node. The remote node must of course have given appropriate access privileges to the node ahead of time and must have memory buffers already registered for remote access.        3. RDMA-Read: this operation permits one node to read data directly from the memory buffer of a remote node. The remote node must of course have given appropriate access privileges to the read requesting node ahead of time.        4. RDMA Atomics: this operation name actually refers to two different operations that have the same effect but which operate different from one another. The Compare & Swap operation allows a node to read a memory location and if its value is equal to a specified value, then a new value is written in that memory location. The Fetch Add atomic operation reads a value and returns it to the caller and then adds a specified number to that value and saves it back at the same address.For the conventional Infiniband Receive Queue, the only type of operation currently supported is:        1. Post Receive Buffer: identifies a buffer into which a client may receive data from an incoming send operation.        