High-performance computing (HPC) has seen a substantial increase in usage and interests in recent years. Historically, HPC was generally associated with so-called “Super computers.” Supercomputers were introduced in the 1960s, made initially and, for decades, primarily by Seymour Cray at Control Data Corporation (CDC), Cray Research and subsequent companies bearing Cray's name or monogram. While the supercomputers of the 1970s used only a few processors, in the 1990s machines with thousands of processors began to appear, and more recently massively parallel supercomputers with hundreds of thousands of “off-the-shelf” processors have been implemented.
There are many types of HPC architectures, both implemented and research-oriented, along with various levels of scale and performance. However, a common thread is the interconnection of a large number of compute units, such as processors and/or processor cores, to cooperatively perform tasks in a parallel manner. Under recent System on a Chip (SoC) designs and proposals, dozens of processor cores or the like are implemented on a single SoC, using a 2-dimensional (2D) array, torus, ring, or other configuration. Additionally, researchers have proposed 3D SoCs under which 100's or even 1000's of processor cores are interconnected in a 3D array. Separate multicore processors and SoCs may also be closely-spaced on server boards, which, in turn, are interconnected in communication via a backplane or the like. Another common approach is to interconnect compute units in racks of servers (e.g., blade servers and modules) that are typically configured in a 2D array. IBM's Sequoia, alleged to be the world's fastest supercomputer, comprises a 2D array of 96 racks of server blades/modules totaling 1,572,864 cores, and consumes a whopping 7.9 Megawatts when operating under peak performance.
One of the performance bottlenecks for HPCs is the latencies resulting from transferring data over the interconnects between compute nodes. Typically, the interconnects are structured in an interconnect hierarchy, with the highest speed and shortest interconnects within the processors/SoCs at the top of the hierarchy, while the latencies increase as you progress down the hierarchy levels. For example, after the processor/SoC level, the interconnect hierarchy may include an inter-processor interconnect level, an inter-board interconnect level, and one or more additional levels connecting individual servers or aggregations of individual servers with servers/aggregations in other racks.
It is common for one or more levels of the interconnect hierarchy to employ different protocols. For example, the interconnects within an SoC are typically proprietary, while lower levels in the hierarchy may employ proprietary or standardized interconnects. The different interconnect levels also will typically implement different Physical (PHY) layers. As a result, it is necessary to employ some type of interconnect bridging between interconnect levels. In addition, bridging may be necessary within a given interconnect level when heterogeneous compute environments are implemented.
At lower levels of the interconnect hierarchy, standardized interconnects such as Ethernet (defined in various IEEE 802.3 standards), and InfiniBand are used. At the PHY layer, each of these standards support wired connections, such as wire cables and over backplanes, as well as optical links. Ethernet is implemented at the Link Layer (Layer 2) in the OSI 7-layer model, and is fundamentally considered a link layer protocol. The InfiniBand standards define various OSI layer aspects for InfiniBand covering OSI layers 1-4.
A high performance fabric can carry different types of traffic where each type can have different requirements for latency. In particular, some traffic may consist of very large messages whose latency is not critical and some traffic may consist of small messages whose latency directly impacts the performance of an application. Often, the performance of an application which runs on multiple nodes in the fabric is determined by the completion time of the last node in the cluster to complete its task. In these apps it is important to have a low minimum and average latency for these latency sensitive messages, and it is just as critical to have a low maximum latency for these messages. The spread between the minimum and maximum latency, called the latency jitter, should be small.
When small messages and large messages are mixed in a fabric a small message may collide with a large packet when it arrives at a switch port just as a large packet begins transmission. In traditional fabrics the small message cannot be transmitted until the large message completes This increases the switch latency seen by the small packet and significantly increases the latency jitter.
Many fabrics address this problem by limiting the maximum size of the large packets, thus limiting the collision-induced delay. This solution negatively affects the efficiency of the fabric. Since smaller packets mean more packets are required to carry a message, and each packet requires a packet header, more total bits are needed to carry a given message.
Proposals have been made to address this problem in Ethernet by defining two classes of traffic, time critical and non-time critical, and allowing time critical frames to preempt non-time critical frames. Different proposals allow the preempted frame to be restarted after preemption or resumed after preemption, with resumption being the preferred option.