Router chips have historically been challenged to handle progressively higher data rates and at higher radix (i.e., the number of inputs and outputs) for at least two reasons. First, chip frequencies for routers are fundamentally limited by resistive-capacitive (RC)/wire delay—after all, there is some minimum amount of time required to get signals from one side of a chip to the other. Increasing the data rate, under a fixed frequency, amounts to employing wider internal busses. However, this is a not a scalable solution since the quanta of data that is moved in a router chip is not large, or at least does not increase with increasing data rates. Second, it is preferable to have a router that is high-radix: it is preferable to have many narrow channels on a router rather than a few wide channels. Again, this is to accommodate narrow native messages and to reduce the total hop-count across the network of router nodes (the average number of hops across the network is inversely proportional to the log of the radix). In fact, the optimal radix for a router chip is roughly linear in both the bandwidth of a router and the log of the number of nodes in the system. For example, very large high-performance computer (HPC) systems with 1000's of sockets need very high bandwidth and high radix routers.
Current approaches to building high-radix, high bandwidth routers have largely focused on topology. A recent survey of router topologies has demonstrated a tendency to focus on the tradeoff between socket-wide topology and the implied complexity of the internal switch crossbars and wiring implications. For instance, consider the simple 2D mesh, where each node is the location of a chip Input-Output (IO) pair (i.e., the input and output of a particular channel), such as shown in FIG. 1. As illustrated, there are 64 nodes, which correspond to 64×64 channels. This configuration also supports 64 IO's (in the absence of having multiple IO's sharing the same node).
This topology has the advantage that each node communicates with at most four neighbors and its local IO—requiring, at worst, a 5×5 switch. In addition, the wiring is very regular. However, the concentration of traffic is very uneven, drastically overburdening the central part of the chip, while the perimeter is comparatively underutilized. This asymmetry of bandwidth usage compromises the 2D-mesh's ability to deliver sufficient bandwidth in many situations.
To better understand how this asymmetry occurs, consider the following. Under most architectures, chip IO's (i.e., inputs and outputs to and from the router chip) come into the chip from the perimeter. At the same time, each node in the 2D mesh operates as a switch, receiving data at an input and forwarding it as an output to an adjacent node. As a result, for IO's that are associated with nodes that are not on the periphery, data is first transferred to the IO's associated node via wiring between the IO and the node. A routing determination is then made at the node, and the data is forwarded from that node to the node associated with the destination IO. Once forwarded to that node, the data is then transferred from the node via wiring to the destination IO. Under this scheme, the nodes toward the center are involved in forwarding more data than the nodes toward the edges, with the nodes along the periphery handling the least amount of traffic.
This 2D-mesh topology has other drawbacks. As mentioned, wiring is implemented between each IO and its associated node (in addition to the wiring between nodes). This is expensive in terms of routing area and energy, and does not scale well. This approach is also inefficient. More precisely, for any topology that brings edge IO's to topological entry points (for routing decisions) distributed over the whole area of the chip, forwards the data to a centrally located exit point and finally sends it to an edge IO, the total expected distance travelled is 1.66 times the edge length of the chip (for a square chip and uniformly random distribution of inputs to outputs). This is nearly double the expected Manhattan distance (i.e., the shortest path between two nodes in a 2D grid) between edge IO input/output pairs. The expected distance for the case where the entry points are located on the perimeter depends on the source edge and the destination edge. For instance, if the source and destination edge are perpendicular to one another, the expected distance is 1, whereas if the source and destination edge are the same, it is 0.33. Finally, if the source and destination edges are opposite one another, the expected distance is 1.33—leading to an average distance of 11/12 (=0.917), which is the Manhattan distance and is thus optimal.