1. Technical Field
The present invention generally relates to processing systems and, in particular, to methods for routing packets on a linear array of processors with nearest neighbor interconnection.
2. Background Description
As used herein, the term “ruler” refers to an in-line arrangement of processing elements, wherein each processing element of the arrangement is connected to its nearest neighbor, if any. The phrase “processing element” is hereinafter interchangeably referred to as “a node” or “a processor”. FIG. 1 is a diagram illustrating an elementary connection scheme (hereinafter referred to as a “direct” connection scheme or method) for an array of eight processors according to the prior art. Packets are injected by senders into left or right moving slots that advance one node per clock cycle. Packets are removed by receivers, freeing the slots. Packets on the top of the ruler move right (in the positive x direction) and packets on the bottom of the ruler move left (in the negative x direction). The links (inputs and outputs) to the left of node 1 and to the right of node 8 are not connected (and are thus not shown).
The nodes may be arranged in a two-dimensional array wherein communication between processors in different rows of the array is achieved by traveling first along horizontal rulers and then along vertical rulers. Each row has a corresponding horizontal ruler and each column has a corresponding vertical ruler. For example, in an exemplary 8 by 8 array of nodes, a packet sent from location (3,4) to location (6,7) enters the array at node (3, 4), travels (4,4)->(5,4)->(6,4) along the horizontal ruler in row 4, hops to the column 6 vertical ruler at node (6,4), and travels (6,4)->(6,5)->(6,6)->(6,7) along the vertical ruler, terminating at location (6,7).
When chips and boards are combined into machines with up to tens of thousands of processor chips, a straightforward generalization of this scheme to three dimensions routes packets first along “x” rulers, then “y” rulers, and finally along “z” rulers. Because of the short distances and constant regeneration by clocking, rulers achieve extremely high communication bandwidth.
Unfortunately, what would seem to be the obvious method for routing packets on a ruler has a serious drawback. The drawback is unfairness, i.e., disparate bandwidth between the nodes of the ruler. In particular, nodes near the outside of the ruler get significantly more bandwidth than nodes near the center of the ruler. This is illustrated in the following example. Suppose that in a ruler with 8 nodes, packets are sent directly from source to destination. To get from node 2 to node 7, a packet travels 2->3->4->5->6->7. Since nodes 1 and 8 are never blocked by packets passing through, they get to inject traffic on every cycle. To a lesser extent, the same is true of nodes 2 and 7. In contrast, nodes 4 and 5, being near the center, are blocked a large fraction of the time.
If a large number of long wires were available, then this problem could be circumvented by a central arbitration scheme. However, the primary virtue of a ruler is that no wire travels more than one element, so that clock rates can be extremely high. In addition, the number of wires required for request/reply arbitration can potentially be as high as the number of wires used for data.
Thus, it would be desirable and highly advantageous to have methods for routing packets on a linear array of processors that provide fairness (no sender is preferred) with respect to all the processors of the array, without reducing bandwidth. Moreover, it would be desirable and highly advantageous to have methods for routing packets on a linear array of processors with reduced latency and power consumption with respect to the prior art.