The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for path resolution in InfiniBand networks.
InfiniBand™ is an industry-standard specification that defines an input/output architecture used to interconnect servers, communications infrastructure equipment, storage and embedded systems. A true fabric architecture, InfiniBand (IB) leverages switched, point-to-point channels with data transfers that generally lead the industry, both in chassis backplane applications as well as through external copper and optical fiber connections. Reliable messaging (send/receive) and memory manipulation semantics (remote direct memory access (RDMA)) without software intervention in the data movement path ensure the lowest latency and highest application performance.
This low-latency, high-bandwidth interconnect requires only minimal processing overhead and is ideal to carry multiple traffic types (clustering, communications, storage, management) over a single connection. As a mature and field-proven technology, InfiniBand is used in thousands of data centers, high-performance compute clusters and embedded applications that scale from two nodes up to clusters utilizing thousands of nodes. Through the availability of long reach InfiniBand over Metro and wide area network (WAN) technologies, InfiniBand is able to efficiently move large data between data centers across the campus to around the globe.
ROCE stands for RDMA over converged Ethernet and allows to use Infiniband APIs and transports over Ethernet physical layer. Applications written for Infiniband can be deployed on Ethernet using ROCE with little or no software changes.
A subnetwork, commonly referred to as a subnet, is a logical subdivision of a Layer-3 network. Network ports of nodes within a given subnet share the same Layer-3 network address prefix. For example, in Internet Protocol (IP) networks, the ports in each subnet share the same most-significant bit-group in their IP address, so that the IP address is logically divided into two fields: a network or routing prefix, and the rest field or host identifier. Similarly, in InfiniBand™ (IB) networks, each subnet is uniquely identified with a subnet identifier known as the Subnet Prefix. For each port in the subnet, this prefix is combined with a respective Port Identifier to give the IB Layer-3 address of the port, known as the Global Identifier (GID). Each port has at least one GID in each network, which is Subnet Prefix plus Globally Unique Port Identifier (GUID) assigned by manufacturer. Non-default port identifiers that are software defined are also possible. ROCE networks also maintain the notion of IB networks and sub-networks, since they deploy Infiniband protocols. The Subnet Prefix is present, while the default port address (GUID) is obtained from Media Access Control (MAC) address using standard translation. Software defined GIDs based on software defined MAC addresses or IP addresses are also possible.
Typically, the logical subdivision of a Layer-3 network into subnets reflects the underlying physical division of the network into Layer-2 local area networks. The subnets are connected to one another by routers, which forward packets on the basis of their Layer-3 (IP or GID) destination addresses, while within a given subnet; packets are forwarded among ports by Layer-2 switches or bridges. These Layer-2 devices operate in accordance with the applicable Layer-2 protocol and forward packets within the subnet according to the Layer-2 destination address, such as the Ethernet™ medium access control (MAC) address or the IB link-layer Local Identifier (LID). In general, Layer-2 addresses in a given subnet are recognized only within that subnet, and routers will swap the Layer-2 address information of packets that they forward from one subnet to another.
In IB networks, a Subnet Manager (SM) in each subnet assigns a LID to each physical port of each host within the given subnet. A subnet administration (SA) function provides nodes with information gathered by the SM, including communication of the LID information to a Subnet Management Agent (SMA) in each node of the subnet. For simplicity and clarity in the description that follows, all of these subnet management and administration functions will be assumed to be carried out by the SM. Layer-2 switches within the subnet are configured by the SM to forward packets among the ports on the basis of the destination LID (D-LID) in the packet header. The SM is typically implemented as a software process running on a suitable computing platform in one of the nodes in the subnet, such as a host computer, switch or appliance.
ROCE transports deployed Ethernet maintain compatibility with Infiniband physical transports by using GID addresses. The GID addresses remain Layer-3 addresses, while Layer-2 used by switches to route packets from source to destination are MAC addresses of Ethernet ports. The MAC addresses can be assigned in hardware (default globally unique MAC address) or be assigned by software. Each port can use more than one MAC address.
DMA can also be used for “memory to memory” copying or moving of data within memory. Either source or destination memory can be IO memory that belongs to a hardware device (for example PCI IO memory). DMA can offload expensive memory operations, such as large copies or scatter-gather operations, from the CPU to a dedicated DMA engine. An implementation example is the I/O Acceleration Technology. Without DMA, when the CPU is using programmed input/output, it is typically fully occupied for the entire duration of the read or write operation, and is thus unavailable to perform other work. With DMA, the DMA master first initiates the transfer and then does other operations while the transfer is in progress, and it finally receives notification from the DMA slave when the operation is done. IO accelerators typically have dedicated DMA master engines, which allow the hardware to copy data without loading the CPU.
Technically, with Interconnect it is not the application code that requests DMA, but the adapter logic (when doing sends it requests DMA from system memory and when doing receives it requests DMA to system memory). On modern systems, the memory controller and DMA slave are part of CPU, so only in then sense the CPU is involved. However, this is a much smaller overhead compared to copying data on CPU, and this does not preempt computational work on CPU. There is no CPU interrupt here since CPU is not master but slave. The Interconnect hardware (IB adapter) knows when transfer has completed.
This feature is useful at any time that the CPU cannot keep up with the rate of data transfer, or when the CPU needs to perform useful work while waiting for a relatively slow I/O data transfer. Many hardware systems use DMA, including disk drive controllers, graphics cards, network cards and sound cards. DMA is also used for intra-chip data transfer in multi-core processors. Computers that have DMA channels can transfer data to and from devices with much less CPU overhead than computers without DMA channels. Similarly, a processing element inside a multi-core processor can transfer data to and from its local memory without occupying its processor time, allowing computation and data transfer to proceed in parallel.
Remote direct memory access (RDMA) is a direct memory access from the memory of one computer into that of another without involving either one's operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clusters. RDMA supports zero-copy networking by enabling the network adapter to transfer data directly to or from application memory of a remote application, eliminating the need to copy data between application memory and the data buffers in the operating systems of source and destination. Such transfers require no intensive work to be done by CPUs, or context switches, and transfers continue in parallel with other system operations (both on local and remote nodes). When an application performs an RDMA Read or Write request, the application data is delivered directly to the network, reducing latency and enabling fast message transfer. However, this strategy presents several problems related to the fact that the target node is not notified of the completion of the request (single-sided communications).
RDMA capable applications exchange messages via objects called queue pairs (QPs). Each QP comprises a send queue and a receive queue, and in order to exchange messages, the local and remote QPs must connect to each other. The process of connection establishment involves sending and receiving connection management (CM) management datagrams (MADs) and is covered by the Infiniband™ specification. A path specification is a part of CM payload, and a CM request cannot be sent before the path is known. The path includes source and destination layer-2 and layer-3 addresses. When an application wants to connect, it is typically aware of remote application by its address assigned by software (IP address or LID or MAC address). To send a CM request, global identifiers (GIDs) need to be resolved from software addresses. This process is called path resolution.
Applications can use RDMA technology only after they have established reliable connections, and establishing a reliable connection requires path resolution to complete. Modern RDMA adapters are powerful, and it is not possible to utilize their power without use of multiple hardware event queues and multiple application threads. For example, a dual-port 100 Gbit adapter can process 6 million sends and 6 million receives per second (using message sizes of 4 KB). Such adapters that have at least 100 event queues and commodity servers with that many CPUs are widely available. One of the scalable approaches to utilize Interconnect and CPU performance is to use a multi-domain approach, where each application thread opens its own device context and binds to its own device event queue. Each thread can pin to a given CPU and pin event queue to receive interrupts on the same CPU. This approach minimizes context switches, cross-CPU communication, and cross-CPU locks, allowing maximization of system performance. At the same time, it requires each application thread to establish connections of its own. This multiplies the number of connections and the number of path queries in the system and requires optimizations in both path queries and connection establishment.