A typical computer has a central processing unit (CPU) which controls the processing of the computer. A computer system may contain more than one processor or node. Such a computer system would be able to process much more data in a faster timeframe in a parallel fashion than would a computer system having a single processor.
A computer having multiple processors configured in a grid with all the processors working simultaneously in parallel is known as a mesh configuration. In a mesh configuration, the processors are connected to their neighboring processors in a mesh. Each node or processor would have 4 edges, or fewer if the processor is on a boundary of the mesh, with each edge connected to the next neighbor processor. If the edges of the mesh were wrapped around such that the processors on the boundary of the mesh were connected to the processors on the opposite boundary of the mesh, a toroidal configuration would result. This is known as a torus network.
Parallel computers with mesh and torus interconnection networks are known in the art because they are able to support many scientific and image processing applications very efficiently and have advantages in terms of ease of construction. A d-dimensional mesh or torus can be implemented with short wires in d-dimensions. In addition, mesh and torus networks can be constructed using identical boards each of which requires only a small number of pins for connections to other boards containing processor units. Because of this modularity a large number of distributed memory parallel computers utilize a mesh or torus interconnection network.
In terms of the differences between torus and mesh configured networks, for given d-dimensional mesh and torus computers of equal size, the torus computer has approximately half the diameter and twice the bisection bandwidth of the mesh computer. Furthermore, torus networks are node symmetric, i.e., all nodes in the torus are identical and therefore no region of the torus is particularly likely to suffer from congestion, which is the condition when the interconnection network becomes clogged with messages and begins to slow itself down. In contrast, mesh networks are not node symmetric and their lack of symmetry can cause certain regions of the mesh to suffer congestion. As a result, torus interconnection networks are expected to play an increasingly important role in future generations of parallel computers.
The processors in a parallel computer communicate with one another by sending or routing packets of data across the network to the other processors. These packets are sent through the interconnection network from their source processors (nodes) to their destination nodes by a packet routing algorithm. A fundamental requirement of any packet routing algorithm is that it must at the very least guarantee that all messages will eventually be delivered to their destinations. In order for the packet routing algorithm to satisfy this basic requirement, it must keep the interconnection network free from conditions known as deadlock, livelock, and starvation.
Deadlock is the condition of the interconnection network in which a set of buffers is completely occupied by messages all of which are only allowed to move to other buffers within the set. As a result, none of the messages in this set of buffers can make progress and none of them will ever be delivered. Livelock is the condition of the interconnection network in which a packet moves between buffers an unbounded number of times without being delivered to its destination processor. Thus, a routing algorithm which is subject to livelock may never deliver a packet to its destination processor even though the packet continues to move throughout the network amongst various nodes. Starvation is the condition of the interconnection network in which a packet waits for a buffer which becomes available an unbounded number of times without ever being granted access to that buffer. Thus a routing algorithm which is subject to starvation may fail to move a packet at all even though a buffer is available into which that packet could be moved.
A packet routing algorithm should also exhibit good performance characteristics. In order to provide good performance, a routing algorithm should avoid sending packets along unnecessarily long routes. A routing algorithm is said to be minimal if the routing algorithm sends each packet along the shortest possible route.
A packet routing algorithm should also be able to adapt to network congestion conditions. A packet routing algorithm is said to be adaptive if it allows packets to adapt to the various traffic conditions in the interconnection network and to select an alternative path based on the congestion any given packet encounters enroute. By allowing packets to take alternate routes which avoid congestion, adaptive routing algorithms can greatly improve network communication performance. An adaptive, minimal routing algorithm that allows every packet to take all of its shortest routes to its destination node is said to be fully adaptive.
Packet routing algorithms can be further classified by the type of switching mode or routing that they utilize. In store-and-forward routing, each packet is stored completely in a node before being sent to the next node along the path. In general, store-and-forward routing is a simple technique which works well when the packets are small in comparison with the channel widths. In contrast, wormhole routing breaks each packet into small pieces called flits. As soon as a flit has been received by a node, the flit is sent to the next node in its path without waiting for the remaining flits of the packet to arrive. This creates a worm of flits which follow one another from node to node through the network towards their destination node. If the head of this worm of flits encounters congestion the entire worm is prevented from making process. Another switching mode which is similar to wormhole routing, is known as virtual cut-through routing. In virtual cut-through routing, each packet is sent as a worm of flits which follow one another through the network with each node buffering the entire worm inside the node whenever congestion occurs on the interconnection network in order to reduce traffic. This requires the use of internal buffers in each node which are set aside for buffering packets that have encountered congestion.
Assuming relatively little message traffic across the interconnection network, wormhole routing and virtual cut-through routing perform well with long messages. However, under heavy traffic conditions virtual cut-through routing performs significantly better than wormhole routing due to the fact that each entire blocked message is stored: internally within one node thereby removing the message from traffic.
One disadvantage of virtual cut-through routing is that it requires significantly more internal node storage than does wormhole routing. Large storage requirements arc undesirable for two reasons. First, providing a large amount of internal storage is expensive in terms of space and overhead. Second, even if sufficient storage is available in the routing hardware, routing algorithms which require large amounts of internal storage in order to avoid deadlock place restrictions on how that storage can be used thus leading to ineffective use of the limited available storage resulting in poor network routing performance.
Many techniques have been developed to reduce the storage requirements of deadlock-free store-and-forward and virtual cut-through routing algorithms. These techniques can be divided into two classes, i.e., those which require only central buffers for storage and those which require that each node have internal buffers that are associated with each edge that is incident to the node. Routing algorithms in the first class require that all packets entering a node are stored in a central buffer. If a large number of packets enter the node simultaneously, some of the packets will be forced to wait while the remaining packets are placed in the central buffers because it may be impractical to design n-ported buffers for large values of n. As a result, the central buffers can become sequential bottlenecks which degrade network communication performance. In contrast, routing algorithms in the second class allow packets which enter a node simultaneously to be routed through the node in parallel because they do not require a single, central resource. Thus, algorithms i,n the second class, i.e., namely those that do not require central buffers, offer the potential for better network performance.