Switches and routers have traditionally employed output-queuing. When packets or cells arrive at an input port, they are immediately transferred by a high-speed switching fabric to the correct output port and stored in output queues. Various queue management policies which have been considered, such as virtual clock algorithms, deficit round robin, weighted fair queuing or generalized processor sharing, and many variations, have attempted to control precisely the time of departure of packets belonging to different virtual circuits (VCs) or flows or sessions, thus providing various quality-of-service (QoS) features such as delay, bandwidth and fairness guarantees.
However, for these pure output-queuing schemes to work, the speed of the switching fabric and output buffer memory must be N times faster than the input line speed where N is the number of input lines, or the sum of the line speeds if they are not equal. This is because all input lines could have incoming data arriving at the same time, all needing to be transferred to the same output port. As line speeds increase to the Gb/s range and as routers have more input ports, the required fabric speed becomes infeasible unless very expensive technologies are used.
To overcome this problem, switches with input-queuing have been used in which incoming data are first stored in queues at the input ports. The decision of which packets to transfer across the fabric is made by a scheduling algorithm. A relatively slower fabric transfers some of the packets or cells to the output ports, where they might be transmitted immediately, or queued again for further resource management. The present invention only considers the problem from the viewpoint of designing a fabric fast enough to manage input queues, regardless of whether there are also output queues.
The ratio of the fabric speed to the input speed is called the xe2x80x9cspeedup.xe2x80x9d An output-queued switch essentially has a speedup of N (whereupon input queues become unnecessary), whereas an input-queued switch typically has a much lower speedup, as low as the minimum value of one, i.e., no speedup. The main advantage of input queuing with low speedup is that the slower fabric speed makes such a switch more feasible and scalable, in terms of current technology and cost. The main disadvantage is that packets are temporarily delayed in the input queues, especially by other packets in the same queues destined for different outputs. In contrast, with output-queuing a packet is never affected by other packets destined for different outputs. This additional input-side queuing delay must be understood or quantified in order for an input-queued switch to provide comparable QoS guarantees as an output-queued switch.
One problem with input-queued switches is that if the next cell to be transmittedxe2x80x94that is, the cell at the head of the queuexe2x80x94is blocked because its destination port is busy, or perhaps because it has a low priority, all other cells queued up behind it are also blocked. This is known as head-of-line blocking. This problem is commonly resolved by allowing per-output queuing, in which each input has not one but M queues corresponding to M outputs. Thus the unavailability of one output does not affect the scheduling of cells bound for other outputs.
Graph theory concepts have been used to develop algorithms in attempts to efficiently select input/output pairs for transmission across the switch fabric. Inputs are treated as one set of nodes, outputs as the second set of nodes, and the paths between input/output pairs having data to transmit, are treated as the edges of the graph. A subset of edges such that each node is associated with only one edge is called a matching.
L. Tassiulas, A. Ephremides, xe2x80x9cStability properties of constrained queuing systems and scheduling policies for maximum throughput in multihop radio networks,xe2x80x9d IEEE Trans.  Automatic Control, vol.37, no.12, December 1992, pp.1936-1948, presented a scheduling algorithm using queue lengths as edge weights and choosing a matching with the maximum total weight at each timeslot. The expected queue lengths are bounded, i.e., they do not exceed some bound, assuming of course that no input or output port is overbooked. That is, this is true even if the traffic pattern is non-uniform, and even if any or all ports are loaded arbitrarily close to 100%. Hence, this xe2x80x9cmaximum weighted matchingxe2x80x9d algorithm, using queue lengths as weights, achieves 100% throughput. For an overview of the maximum weighted matching problem, see e.g., Ahuja, et al, Network flows: theory, algorithms, and applications. Published: Englewood Cliffs, N.J., Prentice Hall, 1993.
No speedup is required for this result. However, a main drawback preventing the practical application of this theoretical result is that maximum weighted matching algorithms are complex and slow, and are therefore not suitable for implementation in high-speed switches. Most algorithms have O(N3) or comparable complexity, and large overhead.
To overcome this problem, faster algorithms have recently been proved to achieve the same result of bounding expected queue lengths, and though not necessarily prior art, are presented here for a description of the present state of the art. For example, Mekkittikul and McKeown, xe2x80x9cA Practical Scheduling Algorithm to Achieve 100% Throughput in Input-Queued Switches,xe2x80x9d IEEE INFOCOM 98, San Francisco, April 1998, uses maximum weighted matchings. However the weights are xe2x80x9cport occupanciesxe2x80x9d defined by w(eij)=sum of queue lengths of all VCs at input port i and all VCs destined to output port j. By using these edge weights, a faster, on the order of N2.5 (O(N2.5)), complexity algorithm can be used to find maximum weighted matchings.
L. Tassiulas, xe2x80x9cLinear complexity algorithms for maximum throughput in radio networks and input queued switches,xe2x80x9d IEEE INFOCOM 98, San Francisco, April 1998 goes one step further and shows that, with the original queue lengths as edge weights, expected queue lengths are bounded by a large class of randomized algorithms. Moreover, some of these algorithms have O(N2) complexity or xe2x80x9clinear complexityxe2x80x9d, i.e., linear in the number of edges.
Mekkittikul and McKeown, xe2x80x9cA Starvation-free Algorithm for Achieving 100% Throughput in an Input-Queued Switch,xe2x80x9d ICCCN 1996 also uses a maximum weighted matching algorithm on edge weights which are waiting times of the oldest cell in each queue. As a result, the expected waiting times, or cell delays, are bounded. This implies queue lengths are bounded, and hence is a stronger result.
All of these results are based on Lyapunov stability analysis, and consequently, all of the theoretically established bounds are very loose. While the algorithm of Tassiulas and Ephremides, and McKeown, Anantharam and Walrand, xe2x80x9cAchieving 100% Throughput in an Input-Queued Switch.xe2x80x9d Proc. IEEE INFOCOM, San Francisco, March 1996, exhibits relatively small bounds in simulations, the sample randomized algorithm given in L. Tassiulas, xe2x80x9cLinear complexity algorithms for maximum throughput in radio networks and input queued switches,xe2x80x9d IEEE INFOCOM 98, San Francisco, April 1998, which is the only xe2x80x9clinear-complexityxe2x80x9d algorithm above, still exhibits very large bounds in simulations. To the best of our knowledge, no linear-complexity algorithm has been shown to have small bounds in simulations and also provide some kind of theoretical guarantee.
Several new works have appeared recently dealing with QoS guarantees with speedup. The earliest of these, Prabhakar and McKeown, xe2x80x9cOn the speedup required for combined input and output queued switching,xe2x80x9d Computer Science Lab Technical Report, Stanford University, 1997, provides an algorithm that, with a speedup of four or more, allows an input-queued switch to exactly emulate an output-queued switch with FIFO queues. In other words, given any cell arrival pattern, the output patterns in the two switches are identical. Stoica, Zhang, xe2x80x9cExact Emulation of an Output Queuing Switch by a Combined Input Output Queuing Switch,xe2x80x9d IWQoS 1998, and Chuang, Goel, McKeown, Prabhakar, xe2x80x9cMatching Output Queuing with a Combined Input Output Queued Switch,xe2x80x9d Technical Report CSL-TR-98-758, Stanford University, April 1998 strengthen this result in two ways. First, their algorithms require only a speedup of two. Second, their algorithms allow emulation of other output-queuing disciplines besides FIFO. These results can therefore be used with many of the common output fair queuing schemes that have known QoS guarantees.
Charny, Krishna, Patel, Simcoe, xe2x80x9cAlgorithms for Providing Bandwidth and Delay Guarantees in Input-Buffered Crossbars with Speedup,xe2x80x9d IWQoS 1998, and Krishna, Patel, Charny, Simcoe, xe2x80x9cOn the Speedup Required for Work-Conserving Crossbar Switches,xe2x80x9d IWQoS 1998, presented several new algorithms that are not emulation-based but provide QoS guarantees that are comparable to those achievable in well-known output-queuing schemes. For example, delay bounds independent of the switch size are obtained with a speedup of six. Delay bounds dependent on the switch size are obtained with a speedup of four. Finally, 100% throughput can be guaranteed with a speedup of two.
While theoretical studies have concentrated on the goals of bounding expected queue lengths and waiting times, various simulation studies have been carried out to investigate other aspects as well, such as average delay, packet loss or blocking probabilities, etc. Some of these studies also investigated the advantage of having a small speedup of about two to five (much smaller than N). The scheduling algorithms used may be based on matching algorithms such as those of the theoretical works cited above, e.g., maximum weighted matching, maximum size (unweighted) matching randomized matchings, etc.
The present invention focuses on three QoS features: bandwidth reservations, cell delay guarantees, and fair sharing of unreserved switch capacity in an input-queued switch with no speedup. Several embodiments employing fast, practical, linear-complexity scheduling algorithms are presented which, in simulations, support large amounts of bandwidth reservation (up to 90% of switch capacity) with low delay, facilitate approximate max-min fair sharing of unreserved capacity, and achieve 100% throughput.
In accordance with the present invention, a method for scheduling transmission of cells through a data switch, preferably a crossbar switch, having a plurality of inputs and outputs, provides a plurality of buffers at each input, each buffer corresponding to an output. The buffers temporarily hold incoming cells. A weight is assigned to each buffer; and buffers are selected according to a weighted matching of inputs and outputs. Finally, cells are transmitted from the selected buffers to the corresponding outputs.
Preferably, the matching requires that each buffer which is not selected must share an input or output with a selected buffer whose weight is greater or equal to the unselected buffer""s weight.
Preferably, the matching is a maximal weighted matching and is determined by using a stable marriage algorithm. Buffers having the greatest weight are selected first, followed by buffers having the next greatest weight, and so on, until buffers having a least positive weight are assigned.
In a preferred embodiment, assigning weights, selecting buffers and transmitting cells are performed repeatedly over consecutive timeslots. Within each timeslot, credits are assigned to each buffer according to a guaranteed bandwidth for that buffer. The weights associated with each buffer are set based on an accumulated number of credits associated with the buffer. Preferably, credits are assigned in integral units, including zero units.
In another preferred embodiment, the weight associated with a buffer is zero if the buffer is empty, regardless of actual credit.
In yet another preferred embodiment, a credit bucket size is assigned to each buffer. If a buffer is empty and has a number of credits exceeding its associated credit bucket size, the buffer receives no further credits.
In still another preferred embodiment, each weight associated with a buffer is set to either the buffer""s length, or to the number of credits associated with the buffer, preferably whichever is less. In an enhancement to this embodiment, the age of each cell is maintained, and if the age for some cell exceeds a predefined threshold for the corresponding buffer, an exception mechanism is employed to decide whether to select the buffer. In another enhancement, cells are flushed out of the buffer with phantom cells during long idle periods.
In yet another preferred embodiment, each buffer""s weight is set to a validated waiting time associated with an oldest cell in the buffer. Validated waiting time for a cell is determined by validating a cell when there is a credit available, and recording the time of validation for each cell. The validated waiting time for that cell is then calculated based on the difference between the current time and the validation time.
Alternatively, the validated waiting time of the oldest cell in a buffer is determined to be either the actual waiting time of the oldest cell, or the age of the oldest credit associated with the buffer, whichever is less.
In yet another alternative, the validated waiting time of the oldest cell is estimated. The estimate is based on the actual waiting time of the oldest cell, the number of credits associated with the buffer, and the rate at which credits are accrued.
In still another preferred embodiment, each buffer""s weight is scaled by a constant which is inversely proportional to a predetermined tolerable delay. Prefereably, the tolerable delay associated with a buffer is the inverse of the guaranteed bandwidth associated with the buffer.
In yet another preferred embodiment, a weighted matching is computed at each timeslot and a corresponding total edge weight for the matching determined. The total edge weight of the determined current matching is compared with the selected matching from the previous timeslot. The matching having the largest total edge weight is selected.
In still another preferred embodiment, fairness is provided in any leftover bandwidth by determining a second matching between remaining inputs and outputs. Buffers are selected according to the second matching, and cells are transmitted from the selected buffers to the corresponding outputs. Preferably, max-min fairness is used to determine the second matching. Alternatively, during a second phase of weight assignments, additional paths are chosen based on usage weights. In yet another alternative, fairness is implemented by assigning weights based on both usage and credits.
In yet another preferred embodiment, several virtual connections share the same input-output pair. Each virtual connection has its own guaranteed rate. At each input, a buffer is provided for each virtual connection passing through that input. For each input/output pair, the virtual connection with the maximum weight is determined, and that weight is assigned to the corresponding input/output pair. Input/output pairs are then selected based on the assigned weights, and according to a maximal weighted matching. Finally, cells are transmitted from the selected inputs to the corresponding outputs.
In still another preferred embodiment, a data structure of linked lists is provided. Each list is associated with a weight, and holds references to buffers which have that weight. In addition, each list has links to next and previous lists associated respectively with weights one greater and one less than the subject list""s associated weight. Each buffer reference is placed in a list associated with the weight of the buffer. Upon incrementing a buffer""s weight by one, its reference is moved from its current list to the next list. Similarly, upon decrementing a buffer""s weight by one, its reference is moved from its current list to the previous list. Finally, for each list in order of descending weights, buffers are selected which do not share input or output nodes with buffers which have already been selected.