High-performance computing (HPC) has seen a substantial increase in usage and interests in recent years. Historically, HPC was generally associated with so-called “Super computers.” Supercomputers were introduced in the 1960s, made initially and, for decades, primarily by Seymour Cray at Control Data Corporation (CDC), Cray Research and subsequent companies bearing Cray's name or monogram. While the supercomputers of the 1970s used only a few processors, in the 1990s machines with thousands of processors began to appear, and more recently massively parallel supercomputers with hundreds of thousands of “off-the-shelf” processors have been implemented.
There are many types of HPC architectures, both implemented and research-oriented, along with various levels of scale and performance. However, a common thread is the interconnection of a large number of compute units, such as processors and/or processor cores, to cooperatively perform tasks in a parallel manner. Under recent System on a Chip (SoC) designs and proposals, dozens of processor cores or the like are implemented on a single SoC, using a 2-dimensional (2D) array, torus, ring, or other configuration. Additionally, researchers have proposed 3D SoCs under which 100's or even 1000's of processor cores are interconnected in a 3D array. Separate multicore processors and SoCs may also be closely-spaced on server boards, which, in turn, are interconnected in communication via a backplane or the like. Another common approach is to interconnect compute units in racks of servers (e.g., blade servers and modules) that are typically configured in a 2D array. IBM's Sequoia, alleged to be the world's fastest supercomputer, comprises a 2D array of 96 racks of server blades/modules totaling 1,572,864 cores, and consumes a whopping 7.9 Megawatts when operating under peak performance.
One of the performance bottlenecks for HPCs is the latencies resulting from transferring data over the interconnects between compute nodes. Typically, the interconnects are structured in an interconnect hierarchy, with the highest speed and shortest interconnects within the processors/SoCs at the top of the hierarchy, while the latencies increase as you progress down the hierarchy levels. For example, after the processor/SoC level, the interconnect hierarchy may include an inter-processor interconnect level, an inter-board interconnect level, and one or more additional levels connecting individual servers or aggregations of individual servers with servers/aggregations in other racks.
It is common for one or more levels of the interconnect hierarchy to employ different protocols. For example, the interconnects within an SoC are typically proprietary, while lower levels in the hierarchy may employ proprietary or standardized interconnects. The different interconnect levels also will typically implement different Physical (PHY) layers. As a result, it is necessary to employ some type of interconnect bridging between interconnect levels. In addition, bridging may be necessary within a given interconnect level when heterogeneous compute environments are implemented.
At lower levels of the interconnect hierarchy, standardized interconnects such as Ethernet (defined in various IEEE 802.3 standards), and InfiniBand are used. At the PHY layer, each of these standards support wired connections, such as wire cables and over backplanes, as well as optical links. Ethernet is implemented at the Link Layer (layer 2) in the OSI 7-layer model, and is fundamentally considered a link layer protocol. The InfiniBand standards define various OSI layer aspects for InfiniBand covering OSI layers 1-4.
Current Ethernet protocols do not have any inherent facilities to support reliable transmission of data over an Ethernet link. This is similar for the link-layer implementation of InfiniBand. Each address reliable transmission at a higher layer, such as TCP/IP. Under TCP, reliable delivery of data is implemented via explicit ACKnowledgements (ACKs) that are returned from a receiver (at an IP destination address) to a sender (at an IP source address) in response to receiving IP packets from the sender. Since packets may be dropped at one of the nodes along a route between a sender and receiver (or even at a receiver if the receiver has inadequate buffer space), the explicit ACKs are used to confirm successful delivery for each packet (noting that a single ACK response may confirm delivery of multiple IP packets). The transmit-ACK scheme requires significant buffer space to be maintained at each of the source and destination devices (in case a dropped packet or packets needs to be retransmitted), and also adds additional processing and complexity to the network stack. For example, as it is possible for an ACK to be dropped, the sender also employs a timer that is used to trigger a retransmission of a packet for which an ACK has not been received within the timer's timeout period. Each ACK consumes precious link bandwidth and creates additional processing overhead. In addition, the use of timers sets an upper limit on link round trip delay.