With respect to the design of large-scale packet switches and routers, it is well known that a pure output buffering strategy, while providing high switching efficiency, is not scalable as switch dimensions get larger. This is due to the requirement that the switch core operate faster than the individual switch ports by a factor equivalent to the number of ports. For this reason, large capacity switches are generally of the “input buffered” variety, with the input and output port modules being interconnected via a crossbar switch fabric.
On the other hand, experience shows that input queuing in conjunction with a first-in-first-out (FIFO) buffering arrangement can severely limit the switch throughput, owing to the so called “head-of-line” (HoL) blocking problem. To overcome this problem, the buffer at each input port is organized into a set of “virtual output queues” (VOQs). Each VOQ is dedicated for packets destined to a particular output port.
FIG. 1 is a schematic diagram illustrating a VOQ switch 100 with N input ports 101 and N output ports 103. It is assumed that time axis is divided into “slots” of equal length. A central switch scheduler 105, in conjunction with the VOQ arrangement, is activated once during each slot. Assuming that packets of fixed length, or “cells”, are equivalent to slots, the central scheduler 105 identifies, during each slot, a set of matching input/output pairs between which cells are transmitted via a crossbar switch 107 without conflict.
With a cell-based arrangement, transmission of variable-length packets necessitates fragmentation of the packets into fixed-size cells prior to switching, with reassembly occurring after switching. This is a limitation of most switching methods currently available.
The central scheduler 105 resolves contention for input and output port access among competing traffic streams (i.e., the N2 VOQs) during each slot. In accordance with the input/output matches made by the central scheduler 105 during each slot, the local scheduler 109 at each input port 101 routes the head-of-line (HoL) packet from the particular VOQ 111 selected.
To implement this functionality, the central scheduler 105 receives reservation “requests” during every slot from all of the switch input ports for accesses to the various switch output ports, and arbitrates these requests to issue a conflict-free set of “grants” to the successful input ports. The requests and grants may propagate on a distinct in-band or out-of-band signaling channel 113. The input/output matches identified for each slot are recorded in a connection matrix, and forwarded (at 115) to the crossbar fabric 107, which is configured accordingly.
The throughput efficiency of the switch 107 is dependent on the efficacy of the scheduling algorithm. An optimal way to perform the scheduling function may be based on a “maximum weight matching” (MWM) approach. However, this is known to have a complexity of O(N5/2), and is not practical to implement at the switching speeds of interest. For this reason, a variety of scheduling algorithms based on various forms of sub-optimal heuristics is currently used in the industry.
Three widely known heuristic algorithms for scheduling traffic in cell-based input-queued switches are “parallel iterative matching” (PIM), “round-robin matching” (RRM) and iSLIP. Each of these algorithms attempts to pick a conflict-free set of input/output matches during each cell slot, with the goal of attaining efficiency (i.e., maximizing the number of matches per cell slot), and fairness (i.e., providing equal bandwidth shares of each input and output port to competing backlogged traffic streams).
PIM achieves these goals by randomly selecting a candidate input for each output port in a first “output arbitration” phase, and then resolving conflicts among the plurality of outputs that may be picked for each input, in a second “input arbitration” phase which also employs a similar randomization strategy.
RRM achieves the same goals in a similar sequence of output and input arbitration phases, except that the selections are made in a deterministic fashion using a round-robin arbitration pointer implemented at each output and input. With their single iteration versions (i.e., the sequence of output arbitration followed by input arbitration being performed only once), the switch throughput under both PIM and RRM subject to full traffic backlog is known to saturate to a little over 60%.
iSLIP operates in a way similar to RRM, except that the movement of the output and input round-robin pointers is conditioned on successful matches, whereas it is unconditional in the case of RRM. With the latter modification, iSLIP is able to achieve 100% saturation throughput with a single iteration in fully backlogged systems.
With multiple iterations (i.e., the arbitration sequence being repeated p times to increase the number of matches), however, all the three schemes attain very nearly 100% throughput under full backlog, and the distinctions among them in terms of other performance attributes such as delay also become relatively indiscernible.