1. Field of the Invention
The present invention relates to network systems and switches that control the flow of data around the network, and more particularly to schedulers that manage the flow of data through high capacity switches.
2. Description of the Related Art
Input queue switch architecture has always been an attractive alternative for high speed switching systems, mainly because the memory access speed of input buffers scale with the speed of a single input line, not with the total switching capacity. However, an input buffered switch has long been known to suffer from head-of-line blocking, which puts a theoretical limit of 58.6% in its total throughput. See, M. J. Karol, M. G. Hluchyj, S. P. Morgan, “Input Versus Output Queuing on a Space-Division Packet Switch”, IEEE Transactions on Communications, Vol. COM-35, No. 12, pp. 1347–1356, December 1987.
More recently, an input queuing technique, called Virtual Output Queuing (VOQ), has been proposed to overcome the head-of-line blocking problem of input switches. See, Y. Tamir and G. Frazier, “High Performance Multi-queue Buffers for VLSI Communication Switches”, Proceedings of 15th Ann. Symp. on Comp. Arch., pp. 343–354, June 1988 and T. Anderson, S. Owicki, J. Saxe, C. Thacker, “High Speed Switch Scheduling for Local Area Networks,” ACM Transactions on Computer Systems, pp. 319–352, November 1993. The idea is to keep separate queues for each output port of a switch, so that the possibility of having a packet destined to an available output port blocked from being served by a head-of line packet which can not proceed due to contention for a different port is eliminated. Thus, a N×N switch has N queues per input port, or N2 queues. As discussed by others, A. Mekkittikul, N. McKeown, “A Practical Scheduling Algorithm to Achieve 100% Though-put in Input-Queued Switches”, Proceedings of Infocom98, April 1998, further exploration of the VOQ technique has shown that indeed 100% throughput is achievable, through the design of smart schedulers.
Schedulers for VOQ input buffered switches, then, become one of the key design points of a high speed input buffered switch. With VOQ, a scheduler has multiple choices for switching packets from backlogged input ports to output ports, much more than in a regular First-In-First-Out (FIFO) input queuing architecture. Every input/output pair of ports can be selected, among the backlogged input ports. Most work devoted to such schedulers can be classified as follows. Centralized schedulers are those for which the scheduler is a single entity, which has information about all N2 VOQs, and makes a scheduling decision about all possible input/output pairs of ports per packet slot. See, for example, A. Mekkittikul, N. McKeown, “A Practical Scheduling Algorithm to Achieve 100% Thoughput in Input-Queued Switches”, Proceedings of Infocom98, April 1998. Distributed schedulers, on the other hand, are those for which the scheduler is partitioned in functional blocks, usually one or two blocks per input or output port or even one block per input/output cross point. See, for example, N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick, M. Horowitz, “The Tiny Tera: A Packet Switch Core”, IEEE Micro, January/February 1997, pp. 26–32 and Y. Tamir and H-C Chi, “Symmetric Crossbar Arbiters for VLSI Communication Switches”, IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 1, pp. 13–27, 1993.
Centralized schedulers require the access to N2 pieces of information before scheduling decisions can be made. Such schedulers are generally not scalable, in the sense that the hardware to implement such schedulers is highly dependent on the number of switch lines N. FIG. 1 illustrates one such scheduler. Distributed schedulers have the potential to make the hardware more independent of the number of switch ports. However, the ones proposed so far still require a communication mechanism which provides information about all N2 queues before a scheduling decision can be made for a particular packet slot. This communication can take place either in a parallel way (as in the SLIP scheduler, see, N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick, M. Horowitz, “The Tiny Tera: A Packet Switch Core”, IEEE Micro, January/February 1997, pp. 26–32), or in a round-robin way (See Y. Tamir and H-C Chi, “Symmetric Crossbar Arbiters for VLSI Communication Switches”, IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 1, pp. 13–27, 1993). Both architectures are shown in FIG. 2. The parallel communication architecture requires an explicit dependence of each block with the size of the switch, since each block must receive N2 messages. The round-robin architecture overcomes this problem, but creates another one: in order to achieve a scheduling decision about all output ports, the message passing must complete a full round within a single packet slot. This requires message processing of at least N times faster than the scheduling decisions.
More recently, a Round-Robin Greedy Scheduler (RRGS) was proposed, a scheduler based on message passing, in which each input port makes a scheduling decision, and passes this information, in a round-robin fashion, to a next neighbor. See, A. Smiljanic, R. Fan, G. Ramamurthy, “RRGS-Round-Robin Greedy Scheduling for Electronic/Qptical Terabit Switches”, NEC C & C Research Laboratories, Technical Report TR 98-C063-4-5083-2, 1998. See, also, co-pending U.S. application Ser. No. 09/206,975. In order to reduce message passing speed requirements, RRGS introduces a pipeline feature. Input ports make scheduling decisions about future slots, far enough into the future, so as to allow enough time for the message passing mechanism to disseminate this information among the other input ports. RRGS can provide high speed scheduling.
Before engaging in the description of the present invention, the general pipelined scheduler architecture will be discussed. For a switch architecture, it is assumed that the scheduling is applied to a pure non-blocking N×N crossbar switch. It is also assumed that Virtual Output Queues (VOQs) are used to take care of the HOL blocking problem. FIG. 3 shows one such switch.
In addition, fixed size packets and uniform link speeds are assumed. Time is slotted, where a slot is defined to be the time taken for the transmission of a single packet by an output link. A non-blocking crossbar can thus switch up to N packets per time slot, if no output port contention exists. The basic task of the scheduler is to determine which VOQ queues, among the N2 which are non-empty, will have access to the output ports, on a per slot basis. For efficiency, the scheduler must resolve all contentions among the backlogged queues within a single time slot.
As line speeds continue to grow, it is paramount that scheduling algorithms be scalable to large capacity switches. Therefore, a distributed architecture seems attractive, since it alleviates the tight processing time required for packet scheduling in a high speed switch. For instance, for a 10 Gbit/s line speed, 16×16 port switch, scheduling decisions must be done at each packet transmission time, 42 ns for a 424 bit ATM cell. If a sequential scheduler is used, each decision must be made in less than 0.16 ns for a 16×16 switch, since N2 decisions must be made. If an optical core is used, it makes sense to distribute the electronic hardware on a per port basis, leaving the total switching bandwidth requirement for the optical core. Moreover, a distributed scheduler should naturally scale to any number of lines. FIG. 4 illustrates such a scheduler.
Each crossbar input port has an Input Port Scheduler Module (SM). Each SM has a distinct ID, SM-ID. In order to maintain scalability with the number of lines, a SM is allowed to communicate with a single immediate neighbor only. This ensures that the SM hardware block can be used with any N×N crossbar fabric. The SM communication chain is shown in FIG. 4. It is used to communicate scheduling information, such as time slot, slot ownership, and output port reserved. The only interaction between the crossbar module and the SMs is via a global clock, which tells every SM what slot is the Current system Time Slot—CTS—as well as the current decision table, with pairs of input/output ports to be switched at CTS (not shown in the figure). This can be implemented by a global memory, to be written by the schedulers, and to be read by the crossbar fabric.
For every time slot, each SM is supposed to have complete freedom of choice to which output port it requests access to. SMs with similar choices generate what is hereafter called a collision, which needs to be resolved before a global scheduling pattern can be determined for a given slot. If a SM is to have current information about all other requests, the communication chain must operate at a speed N times faster than the speed of scheduling decisions. Namely, SMs would need to be able to receive N messages, before making a single scheduling decision. In order to keep the speed of the SM hardware scalable with the line speed, a N look ahead scheduling scheme may be employed. Namely, each SM will make a scheduling decision about a time slot that is at least N slots ahead of the current slot. This feature ensures that a SM knows about others' scheduling decisions already made for the same slot, before making its own scheduling decision. Moreover, this feature comes without the need for speeding up the communication chain to N times the input line speed. As described above, RRGS has the features of distributed scheduling, pipeline scheduling, and N look ahead scheduling.
FIG. 5 is a time chart showing an example of RRGS scheduling employed in the 4×4 crossbar switch, more specifically showing a relationship between four SMs (SM1–SM4) and future time slots T6, T7, . . . , at which each of SM1–SM4 reserves an output port for its own input.
For example, at a time slot T5 of FIG. 5, SM1 performs the scheduling of future time slot T10, that is, chooses an output port for transmission at future time slot T10, and SM3 performs the scheduling of future time slot T9. At the time slot T6 following T5, SM1 performs the scheduling of future time slot T8, and so on.
In this way, each SM performs the scheduling and then transfers the resultant schedule to the next SM, ensuring that each SM timely receives from the previous SM scheduling information about output ports which have been already scheduled. Therefore, if a SM avoids choosing output ports which have been already picked by previous “visitors”, then collisions can be completely avoided.
According to RRGS, however, the sequence of time slots for a SM to pick output ports becomes complicated.
In FIG. 6, more specifically, the respective sequences of time slots for SM1–SM4 are shown, which are obtained by converting the time chart of FIG. 5 into a form suitable for a sequence of visits to time slots for each SM. For SM1, for instance, the sequence of time slots is T10, T8, T11, T9, . . . , which are not systematically arranged in time sequence or reverse time sequence. This causes the implementation and control of RRGS to become complicated.
Further, RRGS performs different scheduling operations depending on whether the number of SMs is even or odd (see, A. Smiljanic, R. Fan, G. Ramamurthy, “RRGS-Round-Robin Greedy Scheduling for Electronic/Qptical Terabit Switches”, NEC C & C Research Laboratories, Technical Report TR 98-C063-4-5083-2, 1998). Therefore, each time a SM is added, the control must be changed, resulting in complicated implementation and control.
Furthermore, SMs are restricted to picking output ports which have not yet been chosen. Therefore, VOQ service rates would become difficult to predict and further a serious fairness problem arises. Assuming in FIG. 4 that SMs #1 and #2 have their queues for a given output port constantly backlogged, while the other SMs have their corresponding queues empty. In this case, three out of four slots will be picked by SM #1 in FIG. 5, since it visits three out of the four slots prior to SM #2, in the sequence of visits as defined in FIG. 5 (see, A. Smiljanic, R. Fan, G. Ramamurthy, “RRGS-Round-Robin Greedy Scheduling for Electronic/Qptical Terabit Switches”, NEC C & C Research Laboratories, Technical Report TR 98-C063-4-5083-2, 1998).
As described above, although the previously described RRGS scheduler can advantageously achieve high-speed scheduling, it has disadvantages that the implementation and control of RRGS becomes complicated and further predicable and adjustable service rates cannot be realized. As discussed above, there is also a problem of fairness, which prevents some of the VOQs from being scheduled because of the states of the other VOQs.