The invention relates to the routing of digital electronic data through network switches.
The Problem: Wormhole Routing Multicast Methods Deadlock
In wormhole routing, flow control is performed on units that are smaller than packets: flow-control digits, or xe2x80x9cflitsxe2x80x9d. FIG. 1 shows a packet broken down into flits: flit1, flit2, flit3 and last flit.
The header (first flit) of the packet advances immediately through each switching element (switch) unless it is blocked because of contention for an output port, and succeeding flits of the packet advance in pipelined fashion behind the header. This immediate forwarding minimizes the latency per switch. When the packet header is blocked, all flits of the packet are buffered in place until the output port is free. Thus, a single blocked packet may be blocked in place across many switches.
One prior approach for multidestination packet based multicast on unidirectional MINs uses strict wormhole routing. There are two general approaches to replicating a wormhole packet at a switch: synchronous and asynchronous.
Under synchronous replication, a multidestination packet""s flit is forwarded (and simultaneously replicated) only if all the required output ports at that stage are free. Note that the required output ports at a stage may belong to more than one switch. If one or more of the required ports at the stage are busy, the packet is blocked in place without replication; replication of the flit to the output ports is done only when all the required output ports become free. The required output ports that are free are reserved although no flit is transmitted to them until the other output ports become free. Thus the various copies of a packet""s flit travel together from one stage to the next.
In contrast, under asynchronous replication, a multidestination packet""s flit is forwarded to all the required output ports that are free when the flit arrives at a stage in the network. Thus, copies of a packet""s flit may travel from one stage of the network to the next at different times. However, the flit cannot be discarded until all the required downstream output ports have received their respective copies of the flit.
If we consider a system adopting strict (pure) wormhole routing consisting of switches that have input buffers of size 1 flit, asynchronous replication does not prove very beneficial since the packet""s next flit will be blocked until the input buffer at the switch becomes free (and the input buffer becomes free only when the required but busy output ports become free). So the only benefit that asynchronous replication offers us in such a system over synchronous replication, is that a single flit can be forwarded on the output ports that have been successfully reserved by the packet. If the input buffer is of size f flits, using strict wormhole routing and asynchronous replication, up to f flits may be transmitted to the output ports that the packet has reserved before the packet blocks because of the required but busy output ports. Prior work has shown that hardware tree-based synchronous replication in networks adopting strict wormhole routing leads to deadlock, but suggested solutions to this have been extremely restrictive and inappropriate for variations of wormhole routing that provide more intermediate buffering.
The essential reason that wormhole methods deadlock is that the progress made by each output port at a replicating switch is dependent upon the progress of every other output port participating in the replication. If one output port is blocked or is currently sending another packet, then the flits to be sent by that output port must remain in the input port buffer, blocking subsequent flits from entering the input port. Therefore, free output ports are blocked by busy output ports. Two multicasts can easily block each other. Multicast A could block multicast B in one switch, while simultaneously multicast B is blocking multicast A another switch.
If the entire packet could be buffered at the input port, it would be possible for unblocked output ports to receive and transmit all flits of the packet, and this would decouple the dependence between output ports for this packet. Virtual cut-through (VCT) flow-control provides this guarantee. VCT allows the same low-latency pipelining as wormhole routing, but for VCT a switch only accepts a new packet when that switch can guarantee buffer space for the entire packet.
SP2 Review Buffered Wormhole Routing
The buffered wormhole routing used in IBM""s SP2 is a variation of wormhole routing wherein every switch in the network is equipped with a central buffer, as illustrated in FIG. 2.
When packets are blocked at a switch due to a busy output port, the switch attempts to store the packet in this central buffer, thus freeing the links held by the trailing packet flits. There may be enough space in the central buffer to store the entire packet. However, there is no guarantee that a packet arriving at a switch will find enough space in the central buffer to be completely stored. If the central buffer does not have adequate space to store the entire blocked packet, as many as possible of the packet flits are stored in the central buffer and the remainder of the packet is blocked in place. Note that in the absence of contention, packets may propagate through the network just as in a purely wormhole routed network, and the central buffers will remain empty. Alternately, a switch could be configured to force each packet though the central buffer, even when a packet encounters no contention.
Because there is no assurance that the central buffer can store an entire multidestination packet, the central buffer as implemented in SP2 cannot guarantee to prevent multicast deadlock. However, an SP2-like shared central buffer is an extremely attractive resource for packet replication. We will describe improvements to the basic central buffer free-space logic that are similar to virtual cut-through operation. Specifically, these improvements guarantee that any packet admitted to the central buffer can (eventually) be entirely stored. This guarantee effectively decouples the interdependence of the replicated output packets at a switch, eliminating the cause of multicast wormhole routing deadlock.
In the SP2 buffered wormhole implementation of the invention, the central buffer is constructed so as to effectively form a separate FIFO queue of packets for each output port. Each input port can write flits into the buffer, and each output port can read flits. Central buffer space is dynamically allocated to requesting input ports in a fair manner.
A number of flits are buffered into a chunk before being written into the central buffer, and chunks are read from the central buffer before being disassembled into flits again at the reading output port. This reduces the number of central buffer RAM read and write ports required. As an example, in the 8-ported SP2 routing elements, up to 1 flit is received or transmitted at each input port or output port every cycle. An SP2 chunk is 8 flits, and thus the central buffer only requires 1 RAM write port and 1 RAM read port to match the input and output bandwidth of the switch. The central buffer logic maintains a list of free chunk locations. A central buffer write allocates a free chunk, and a read returns a free chunk.
There must be a mechanismxe2x80x94we will call it the next-packet listxe2x80x94to order the packets within each packet queue. Each packet is divided into chunks, and thus there is also a mechanismxe2x80x94the next-chunk listxe2x80x94to order the chunks within a packet. First we describe the next-packet lists.
To record the next-packet linking information, a pointer field is available for each chunk of data: the next-packet (NP[ ]) field. In addition, each output port o maintains first-packet (firstP[o]) and last-packet (lastP[o]) pointers into its packet queue. For this description, all pointers are assumed to be nil when invalid. In the following discussion, we shall assume input port i is writing chunks to output port o.
To record these two types of linking information, two pointer fields are associated with each chunk of data: the next-packet (NP[ ]) field 302 and the next-chunk (NC[ ]) field 304 (see FIG. 3).
In addition, each output port o maintains first-packet (firstP[o]) and last-packet (lastP[o]) pointers into its packet queue, and a first-chunk (firstC[o]) field that points to the next chunk to be read if output port o has not read the last chunk of the current packet. Each input port i maintains a last-chunk (lastC[i]) field that points to the last chunk written by input port i. All pointers are assumed to be nil when invalid. In the following discussion, we shall assume input port i is writing chunks to output port o.
The next-packet list is updated each time the first chunk (the header chunk) of a packet is written. If no packets are currently on the destination output port""s packet queue firstP[o]xe2x89xa1nil, then firstP[o]←writeloc, where writeloc is the address where the header is written. Otherwise, NP[lastP[o]]←writeloc. The last-packet pointer is updated (lastP[o]←writeloc), and the packet-list is terminated (NP[writeloc]←nil).
The logical structure of a typical output port queue within the central buffer is shown in FIG. 4.
There are two packets shown, each with its associated chunks displayed in a column. The lightly-shaded fields indicate fields that are not currently valid (e.g., next-packet fields are not used except for header chunks).
When a header chunk is read from the central buffer, the next-packet list must also be updated (firstP[o]←NP[readloc]). It should be evident that the order of packets on a queue is entirely determined by the order of header chunk writes.
The next-chunk fields provide a similar linking function between packet chunks. On a write, when a valid last-chunk pointer exists, the central buffer next-chunk location pointed to by last-chunk is updated with the location of the currently written chunk (if lastC[i]≢nil, then NC[lastC[i]]←writeloc). When an input port writes a chunk, it also updates its last-chunk pointer with the write location (lastC[i]←writeloc). The one exception is a write of the last chunk of a packet: in this case the last-chunk pointer becomes invalid (lastC[i]←nil).
On the output port side, except for when a header chunk is being read, the output port first-chunk field is used to determine the location of the next central buffer chunk read (readloc firstC[o]). For header chunk reads, the first-packet pointer is used (readloc←firstP[o]). On every chunk read, the output port""s first-chunk pointer is updated with the associated central buffer next-chunk pointer (firstC[o]←NC[readloc]). If NC[readloc]←nil and this chunk is not the last packet chunk, then the next packet chunk is not yet in the central buffer. In this case the output port becomes suspended (cannot read any more chunks from the central buffer until the next chunk of this packet enters the central queue). During suspension the associated input port""s last-chunk pointer also becomes invalid (lastC[i]←nil). When the expected chunk is finally written, the first-chunk pointer is updated with the location of that chunk, unsuspending the output port (firstC[o]←writeloc).
FIG. 5 shows the structure of the queue from FIG. 4 after the first two chunks of the first packet in the queue have been read by the output port.
Note that firstP and firstC have been updated, and firstC is now a valid pointer field required for retrieving the next chunk from the queue.
With only the mechanisms described, an output port may starve (input ports may not be able to forward chunk through the central queue to the output port). Starvation prevention methods are relatively straightforward but will not be described here in order to simplify discussion; the replication methods to be described do not change the nature of this starvation scenario.
The invention is a method for replicating a packet within a switching unit utilizing wormhole flow control, comprising:
a) receiving a packet to be replicated and forwarded to an output port or ports, the packet containing data and destination address information;
b) storing the packet in a buffer;
c) notifying each target output port that the packet is destined for that output port;
d) forwarding the packet to each predetermined output port when the output port is available;
e) when the packet has been forwarded to each predetermined output port, deleting the packet from the buffer.