1. Field of the Invention
This invention relates in general to the Infiniband high-speed serial link architecture, and more particularly to a method for performing remote direct memory access data transfers through the architecture.
2. Description of the Related Art
The need for speed in transferring data between computers and their peripheral devices, such as storage devices and network interface devices, and between computers themselves is ever increasing. The growth of the Internet is one significant cause of this need for increased data transfer rates.
The need for increased reliability in these data transfers is also ever growing. These needs have culminated in the development of the Infiniband(trademark) Architecture (IBA), which is a high speed, highly reliable, serial computer interconnect technology. The IBA specifies interconnection speeds of 2.5 Gbps (Gigabits per second), 10 Gbps and 30 Gbps between IB-capable computers and I/O units, referred to collectively as IB end nodes.
One feature of the IBA that facilitates high-speed data transfers is the Remote Direct Memory Access (RDMA) operation. The IBA specifies an RDMA Write and an RDMA Read operation for transferring large amounts of data between IB nodes. The RDMA Write operation is performed by a source IB node transmitting one or more RDMA Write packets including payload data to the destination IB node. The RDMA Read operation is performed by a requesting IB node transmitting an RDMA Read Request packet to a responding IB node and the responding IB node transmitting one or more RDMA Read Response packets including payload data.
One useful feature of RDMA Write/Read packets is that they include a virtual address identifying a location in the system memory of the destination/responding IB node to/from which the data is to be transferred. That is, an IB Channel Adapter in the destination/responding IB node performs the virtual to physical translation. This feature alleviates the operating system in the destination/responding IB node from having to perform the virtual to physical translation. This facilitates, for example, application programs being able to directly specify virtual addresses of buffers in their system memory without having to involve the operating system in an address translation, or even more importantly, in a copy of the data from a system memory buffer to an application memory buffer.
An IB Channel Adapter (CA) is a component in IB nodes that generates and consumes IB packets, such as RDMA packets. A Channel Adapter connects a bus within the IB node that is capable of accessing the IB node memory, such as a PCI bus, processor bus or memory bus, with the IB network. In the case of an IB I/O node, the CA also connects I/O devices such as disk drives or network interface devices, or the I/O controllers connected to the I/O devices, with the IB network. A CA on an IB I/O node is commonly referred to as a Target Channel Adapter (TCA) and an IB processor node is commonly referred to as a Host Channel Adapter (HCA).
A common example of an IB I/O node is a RAID (Redundant Array of Inexpensive Disks) controller or an Ethernet controller. An IB I/O node such as this typically includes a local processor and local memory coupled together with a TCA, and I/O controllers connected to I/O devices. The conventional method of satisfying an RDMA operation in such an IB I/O node is to buffer the data in the local memory when transferring data between the I/O controllers and the IB network.
For example, in performing a disk read operation, the local processor on the IB I/O node would program the I/O controller to fetch data from the disk drive. The I/O controller would transfer the data from the disk into the local memory. Then the processor would program the TCA to transfer the data from the local memory to the IB network.
For a disk write, The TCA would receive the data from the IB network and transfer the data into the local memory. Then the processor would program the I/O controller to transfer the data from the local memory to the disk drive. This conventional approach is referred to as xe2x80x9cdouble-bufferingxe2x80x9d the data since there is one transfer across the local bus into memory and another transfer across the local bus out of memory.
The double-buffering solution has at least two drawbacks. First, the data transfers into and out of memory consume twice as much of the local memory and local bus bandwidth as a direct transfer from the I/O controller to the TCA. This may prove detrimental in achieving the high-speed data transfers boasted by the IBA.
To illustrate, assume the local bus is a 64-bit wide 66 MHz PCI bus capable of sustaining a maximum theoretical bandwidth of 4 Gbps. With the double buffering solution, the effective bandwidth of the PCI bus is cut in half to 2 Gbps. Assuming a realistic efficiency on the bus of 80%, the effective bandwidth is now 1.6 Gbps. This is already less than the slowest transfer rate specified by IB, which is 2.5 Gbps.
To illustrate again, assume the local memory controller is a 64-bit wide, 100 MHz SDRAM controller capable of sustaining a maximum theoretical bandwidth of 6 Gbps. Again, assuming the conventional double buffering solution and an 80% efficiency yields an effective bandwidth of 2.4 Gbps. Clearly, this leaves no room in such an I/O node architecture for expansion to the higher IB transfer speeds.
The second drawback of the double buffering solution is latency. The total time to perform an I/O operation is the sum of the actual data transfer time and the latency period. The latency is the time involved in setting up the data transfer. No data is being transferred during the latency period. The double buffering solution requires more time for the local processor to set up the data transfer. The local processor not only sets up the initial transfer into local memory, but also sets up the transfer out of memory in response to an interrupt signifying completion of the transfer into local memory.
As data transfer rates increase, the data transfer component of the overall I/O operation time decreases. Consequently, the local processor execution latency time becomes a proportionately larger component of the overall I/O operation time, since the processor latency does not typically decrease proportionately to the data transfer time. The negative impact of latency is particularly detrimental for I/O devices with relatively small units of data transfer such as network interface devices transferring IP packets. Thus, the need for reducing or eliminating latency is evident.
Therefore, what is needed is an IB CA capable of transferring data directly between a local bus, such as a PCI bus, and an IB link without double buffering the data in local memory.
To address the above-detailed deficiencies, it is an object of the present invention to provide an Infiniband channel adapter that transfers data directly between a local bus and an Infiniband link without double buffering the data in system memory. Accordingly, in attainment of the aforementioned object, it is a feature of the present invention to provide an Infiniband channel adapter that includes a local bus interface for coupling the channel adapter to an I/O controller by a local bus. The local bus interface receives data from the I/O controller if a local bus address of the data is within a predetermined address range of the local bus address space. The channel adapter also includes a bus router, in communication with the local bus interface, that creates an Infiniband RDMA Write packet including the data in response to the local bus interface receiving the data from the I/O controller. The channel adapter then transmits the created packet to a remote Infiniband node that previously requested the data.
An advantage of the present invention is that it avoids the reduction in useable bandwidth of the local bus and of a system memory by not double-buffering the data, but instead transferring the data directly from the I/O controller to the channel adapter for transmission on the Infiniband wire. Another advantage of the present invention is that it reduces local processor latency by not involving the local processor in setting up a double-buffered transfer.
In another aspect, it is a feature of the present invention to provide an Infiniband channel adapter that includes a bus router that receives an Infiniband RDMA Read Response packet, having a payload of data, transmitted by a remote Infiniband node. The channel adapter also includes a local bus interface, in communication with the bus router, that provides the payload of data to an I/O controller coupled to the local bus interface by a local bus if a local bus address specified by the I/O controller is within a predetermined address range of the local bus address space.
In yet another aspect, it is a feature of the present invention to provide an Infiniband I/O unit that includes an Infiniband channel adapter, an I/O controller, coupled to the channel adapter by a local bus, and a processor. The processor programs the I/O controller to transfer data to the channel adapter on the local bus at an address within a predetermined address range of the local bus address space dedicated for direct data transfers from the I/O controller to the channel adapter. The channel adapter receives the data from the I/O controller and creates an Infiniband RDMA Write packet including the data for transmission to a remote Infiniband node only if the address is within the predetermined address range.
In yet another aspect, it is a feature of the present invention to provide An Infiniband I/O unit that includes an Infiniband channel adapter, for receiving an Infiniband RDMA Read Response packet including a payload of data transmitted from a remote Infiniband node, an I/O controller, coupled to the channel adapter by a local bus and a processor. The processor programs the I/O controller to transfer the data in the payload from the channel adapter on the local bus at an address within a predetermined address range of the local bus address space dedicated for direct data transfers from the channel adapter to the I/O controller. The channel adapter provides the data to the I/O controller only if the address is within the predetermined address range.
It is also an object of the present invention to provide a method for translating virtual addresses of remote Infiniband nodes to local addresses on a local Infiniband node in a way that facilitates direct transfers between a local bus I/O controller and an Infiniband link of the local Infiniband node.
In yet another aspect, it is a feature of the present invention to provide a method for translating Infiniband remote virtual addresses to local addresses. The method includes a local Infiniband node receiving in a first Infiniband packet a first virtual address of a first memory location in a remote Infiniband node. The method further includes allocating a local address within a local address space of a local bus on the local node for transferring first data directly between an I/O controller of the local node and an Infiniband channel adapter of the local node in response to the receiving the first virtual address. The method further includes the local Infiniband node receiving in a second Infiniband packet a second virtual address of a second memory location in the remote Infiniband node, wherein the first and second virtual addresses are spatially disparate. The method further includes allocating the local address for transferring second data directly between the I/O controller and the channel adapter in response to the receiving the second virtual address.
An advantage of the present invention is that it enables translating of multiple different virtual addresses in a remote IB node into the same local address bus space. That is, the local address space is reusable with respect to the remote virtual address space that may be much larger than the local address space.
In yet another aspect, it is a feature of the present invention to provide a method for translating Infiniband remote virtual addresses to local addresses. The method includes a local Infiniband node receiving in a first Infiniband packet a first virtual address of a first memory location in a first remote Infiniband node. The method further includes allocating a local address within a local address space of a local bus on the local node for transferring first data directly between an I/O controller of the local node and an Infiniband channel adapter of the local node in response to the receiving the first virtual address. The method further includes the local Infiniband node receiving in a second Infiniband packet a second virtual address of a second memory location in a second remote Infiniband node. The method further includes allocating the local address for transferring second data directly between the I/O controller and the channel adapter in response to the receiving the second virtual address.
An advantage of the present invention is that it enables translating of virtual addresses of multiple different remote IB nodes into the same local address bus space. That is, the local address space is reusable with respect to the potentially large mapped virtual address spaces of many remote hosts accumulated together and potentially overlapping in their individual virtual address spaces.
In yet another aspect, it is a feature of the present invention to provide a method for translating Infiniband remote virtual addresses to local addresses. The method includes a local Infiniband node receiving in a first Infiniband packet a virtual address of a memory location in a remote Infiniband node. The method further includes allocating a first local address within a local address space of a local bus on the local node for transferring first data directly between an I/O controller of the local node and an Infiniband channel adapter of the local node in response to the receiving the virtual address in the first packet. The method further includes receiving in a second Infiniband packet the virtual address of the memory location in the remote Infiniband node, by the local Infiniband node. The method further includes allocating a second local address for transferring second data directly between the I/O controller and the channel adapter in response to the receiving the virtual address in the second packet.