Historically used only in high-end supercomputers, interconnection networks are now found in systems of all sizes and all types: from large supercomputers to small embedded systems-on-a-chip (SoC) and from inter-processor networks to router fabrics. Indeed, as system complexity and integration continues to increase, many designers are finding in the interconnection networks technology more efficient ways to route packets and economic solutions to build computer clusters.
Interconnection networks for supercomputers in general, and particularly for recent massively parallel computing systems based on computer clusters, demand high performance requirements. The fundamental topics in the design of interconnection networks that determine the performance tradeoffs are: topology, routing, and flow-control.
Interconnection networks are built up of switching elements and topology is the pattern in which the individual switches are connected to other elements, like processors, memories and other switches. Among the known topologies, fat-trees have raised in popularity in the past few years and are used in many commercial high-performance switch-based point-to-point interconnects: for instance, InfiniBand connectivity products supplied by Mellanox Technologies (www.mellanox.com), Myrinet by Myricom (www.myri.com) and the networking products developed by Quadrics (www.quadrics.com).
Fat-tree topology is a particular case of a multistage interconnection/switching network, that is, a regular topology in which switches are identical and organized as a set of stages. Each stage is only connected to the previous and the next stage using regular connection patterns. A fat-tree topology is based on a complete tree: a set of processors is located at the leaves and each edge of the tree corresponds to a bidirectional channel. Unlike traditional trees, a fat-tree gets thicker near the root. In order not to increase the degree of the switches as they go nearer to the root, which makes the physical implementation unfeasible, an alternative implementation is the k-ary n-tree. In what follows, the term fat-tree also refers to k-ary n-trees.
A k-ary n-tree is composed of N=kn processing nodes and nkn−1 switches with a constant degree k≧1: each switch has 2k input ports and 2k outputs ports, being k of them ascending ports (through which the switch is connected to a next stage switch) and k descending ports (through which the switch is connected to a previous stage switch). Each processing node is represented as a n-tuple {0, 1, . . . k−1}n and each switch is defined as a pair <s, o>, being sε{0 . . . (n−1)} the stage at which the switch is located and stage 0 is considered as the closest one to the processing nodes, and o is a (n−1)-tuple {0, 1, . . . , k−1}n−1. In a fat-tree, two switches <s, on−2, . . . , o1, o0> and <s′, o′n−2, . . . , o′1, o′0> are connected by an edge, if and only if s′=s+1 and oi=o′i for all i≠s. On the other hand, there is an edge between the switch <0, on−2, . . . , o1, o0> and a processing node, represented the processing node as a series of n links: pn−1, . . . , p1, p0, if and only if oi=pi+1 for all iε{n−2, . . . , 1, 0}. Descending links of each switch will be labelled from 0 to k−1, and ascending links from k to 2k−1.
Routing is one of the most important design issues of interconnection networks. Routing schemes can be mainly classified as source and distributed routing. In source routing the entire path to the destination is known to the sender of a packet, so that the sender can specify the route, when sending data, which the packet takes through the network. Source routing is used in some networks, for instance in Myrinet, because routers are very simple. On the other hand, distributed routing allows more flexibility, but the routers are more complex. Distributed routing can be implemented by a fixed hardware specific to a routing function on a given topology, or by using forwarding tables that are very flexible but suffer from a lack of scalability. Examples of commercial interconnection networks using distributed routing are InfiniBand and Quadrics.
For both source and distributed routing, the routing strategy determines the path that each packet follows between a source-destination pair, performing adaptive or, otherwise, deterministic strategies or a combination of both. In deterministic routing, an injected packet traverses a fixed, predetermined, path between source and destination; while in adaptive routing schemes the packet may traverse one of the different alternative paths available from the packet source to its destination. Adaptive routing takes into account the status of the network when taking the routing decisions and usually better balances network traffic, and so this allows the network to obtain a higher throughput, however out-of-order packet delivery may be introduced, which is unacceptable for some applications. Deterministic routing algorithms usually do a very poor job balancing traffic among the network links, but they are usually easier to implement, easier to be deadlock-free and guarantee in-order delivery.
An adaptive routing algorithm is composed of the routing and selection functions. The routing function supplies a set of output channels based on the current and destination nodes. The selection function selects an output channel from the set of channels supplied by the routing function. For example, the selection function may choose at each stage the link with the lowest traffic load.
Routing in fat-trees is composed of two phases: an adaptive upwards phase and a deterministic downwards phase. The unique downwards path to the destination depends on the switch that has been reached in the upwards phase. In fat-trees, the decisions made in the upwards phase by the selection function can be critical, since it determines the switch reached in the ascending path and, hence, the unique downwards path to the destination. Therefore, the selection function in fat-trees has a strong impact on network performance.
A distributed deterministic routing strategy is implemented, for example, in InfiniBand, thus there is only one route per source-destination pair. Nonetheless, InfiniBand offers the possibility to use virtual destinations and there can be a plurality of virtual destinations corresponding to a real destination, allowing the traffic to be distributed through different adaptive routes determined between the source and each virtual destination, for the same source-destination pair. A sever drawback of this proposal [see “A Multiple LID Routing Scheme for Fat-Tree-Based InfiniBand Networks” by X. Lin, Y. Chung, and T. Huang, Parallel and Distributed Processing Symposium, April 2004], and in general of adaptive routing, is suffered when a given destination is congested, because the traffic keeps on being spread along the different adaptive routes, contributing to overall network congestion.