Switches are important components of Internet Protocol routers, optical routers, wireless routers, ATM and MPLS switches, computing systems and many other systems. Three basic types of switch architectures exist. The Input-Queued (IQ) Switches, the Output-Queued (OQ) switches, and the Crosspoint Queued (XQ) switches. The Internet carries variable-size Internet Protocol (IP) packets which typically vary in size from 64 bytes up to a maximum of 1500 bytes. In synchronous Internet routers and switches employing fixed-sized cells, variable-size IP packets are reformatted into multiple fixed-sized cells which are stored in queues at the input side of the switch. These cells are scheduled for transmission through the switch by a scheduler, and are eventually switched to the output side where they may be stored in output queues. At the output side of the switch, the variable-size IP packets may be reconstructed from the fixed sized cells, and scheduled for transmission to the next router.
OQ switches place all the cell buffers (queues) at the output side of the switch. In each time-slot, each input port of the switch may receive up to one cell. Each cell has a tag which identifies the desired destination output port. Each input port simply forwards any cell it may receive to the desired output port in every time-slot, in an OQ switch, each output port (OP) may receive up to N cells simultaneously from all N input ports in each time-slot. A speedup of O(N) is required at each output port, to move up to N cells simultaneously into the output queue at each output port in one time-slot. Speedup is typically implemented by adding extra wires to the output ports of the switch, and by running the queue memories at the output ports N times faster than the queue memories at the input ports. The speedup is costly, and is usually avoided in practical switches. OQ switches can achieve up to 100% throughput with very simple scheduling algorithms, but they require an output ‘speedup’ of O(N) which renders them impractical for large switches. OQ switches are described in a paper by M. Hluchyi, M. Karol and S. Morgan, entitled “Input Versus Output Queueing on a Space Division Switch”, IEEE Trans Commun., vol. 35, 1987, which is hereby incorporated by reference.
In contrast, IQ switches place all the cell buffers at the input side of the switch. Each input port typically has N ‘Virtual Output Queues’ identified as VOQ(j,k), for 1<=j<=N and 1<=k<=N. N×N IQ switch therefore has N-squared (N^2) VOQs. In each time-slot, each input port of the switch may receive up to one cell, which contains a tag which identifies the desired destination output port. At each input port, an arriving cell is moved into a VOQ associated with the desired output port. IQ switches typically are built with no extra speedup. IQ switches with no speedup operate under 2 constraints, called the input Constraint and the Output Constraint. The input constraint requires that every input port transmits at most 1 cell per time-slot to the switch. The output constraint requires that every output port receives at most 1 cell per time-slot from the switch. These constraints maim the scheduling of traffic through an IQ switch challenging. In each time-slot, a scheduler should find a set of up to N packets to transmit through the switch, which satisfies both the input and output constraints. A set of packets which satisfy these two constraints can be represented as a matching in a bipartite graph, or as a permutation matrix. A permutation matrix is defined herein as a matrix whose elements are only 0 or 1, where the sum of every row is <=1, and where the sum of every column is <=1. It has been shown in theory that IQ switches can achieve up to 100% throughput, but they require a complex scheduling algorithm to schedule the traffic through the switch subject to the input constraints and the output constraints. A paper by N. McKeown, A. Mekkittikul, V. Anantharam, J. Walrand, entitled “Achieving 100% Throughput in an Input-Queued Switch”, IEEE Transactions on Communications, Vol. 47, No. 8, August 1999, pp. 1260-1267, is hereby incorporated by reference. This paper proposes a complex scheduling algorithm to achieve 100% throughput in an IQ switch.
Scheduling for IQ switches is known to be a difficult problem. The selection of a conflict-free set of up to N cells to transfer per time-slot is equivalent to finding a matching in to bipartite graph. Assuming a 40 Gbps link rate with 64-byte cells, the duration of a time-slot is 12.8 nanoseconds. Therefore, a scheduler for an IQ switch with 40 Gbps links computes a new bipartite graph matching every 12.8 nanosec. As Internet link rates increase to 160 or 640 Gbps, the time-slot duration would decrease to 3.2 and 0.8 nanoseconds respectively. The best known algorithms for computing a bipartite graph matching require O(N^2.6) or O(N^3) time which renders them too complex for use in Internet routers. Therefore, existing schedulers for IQ switches typically use heuristic or sub-optimal schedulers. Heuristic algorithms cannot achieve 100% throughput and cannot typically provide adequate bounds or guarantees on the performance and Quality of Service (QoS) of the switch.
Recently, an algorithm for scheduling traffic in IQ switches which can achieve 100% throughput while providing guarantees on the rate, delay, jitter and service lag was described in a US patent application by T. H. Szymanski, entitled ‘Method and Apparatus to Schedule Traffic Through a Crossbar Switch with Delay Guarantees’, application Ser. No. 11/802,937, Pub. No. US 2007/0280261 A1, which is hereby incorporated by reference. The document describes a recursive and fair method to decompose a N×N traffic rate matrix R, which describes the traffic requirements to be realised in an IQ switch in a scheduling frame of length. F time-slots. Each matrix element R(i,j) equals the requested number of connections between input port i and output port j, in the scheduling frame. An admissible traffic rate matrix is defined as a traffic rate matrix which does not overload the import ports or the output ports of the switch. Such a matrix has non-zero elements where the sum of every row is <=F and where the sum of every column is <=F. The algorithm described in the patent application Ser. No. 11/802,937 will process an admissible traffic rate matrix and compute F bipartite graph matchings which are guaranteed to realize the traffic requirements in the traffic rate matrix. The method schedules N-squared traffic flows through an N×N IQ switch with guarantees on the performance and QoS. The algorithm has a computational complexity of O(NFlogNF) time to compute the F bipartite graph matchings for a scheduling frame, which is considerably more efficient than previously proposed scheduling algorithms for IQ switches. The algorithm eliminates all conflicts at the Input ports and Output ports of an IQ switch, by decomposing an N×N traffic rate matrix which reflects the coupled dependencies between the IO ports of the switch, in a recursive and fair manner.
The challenges of IQ switches and OQ switches have led to research on combined switches. Combined Input and Output Queued switches, denoted CIOQ switches, can achieve 100% throughput typically with a speedup of 2 or 4, but they also require complex scheduling algorithms which are considered too complex for Internet routers. A paper by H. Lee and S W. Seo, entitled “Matching Output Queueing with a Multiple Input/Output-Queued Switch”, IEEE Transactions on Networking, Vol. 14, No. 1, February 2006, pp. 121-131, describes CIOQ switches and is hereby incorporated by reference. The paper describes a CIOQ switch which requires a speedup of 2 and which can exactly emulate the performance of an OQ switch. More recently, the research community is exploring Combined input and Crosspoint Queued switches, denoted CIXQ switches. CIXQ switches contain queues at the input Ports and at each crosspoint of the switching matrix. They may contain reassembly queues at the output ports, but these are inherent in most switches. A CIXQ switch contains N-squared (denoted N^2) VOQs at the input side, and N-squared crosspoint queues (XQs) at the crosspoints of the switching matrix. In principle these switches can achieve up to 100% throughput, but they also require efficient scheduling algorithms.
The scheduling of traffic in a CIXQ switch is simplified relative to scheduling for an IQ switch, since the input and output ports are decoupled in the CIXQ switch. Scheduling consists of 2 independent processes. In step 1, cells are scheduled for transmission from the VOQs at the input side of the switch, into the XQs of the switching matrix. There is a one-to-one correspondence between the N-squared VOQs at the input side of the switch, and the N-squared XQs within the switching matrix. In step 2, cells are scheduled from the XQs of the switching matrix to the output ports of the switch. Once cells arrive at the output ports, the variable-size IP packets may be reconstructed at the output queues (if necessary) and transmitted to the next router towards the destination. The scheduling is simplified since the addition of the N^2 XQs in the switching matrix makes the scheduling of the input and output ports decoupled and independent. The input constraints and output constraints associated with an IQ switch do not need to be simultaneously satisfied by the N cells which are transmitted into the CIXQ switch in each time-slot. In principle, to achieve 100% throughput in a CIXQ switch, in each time-slot each input port can transmit to any non-full XQ, and each output port can receive from any non-empty XQ. Several prior papers present scheduling algorithms for CIXQ switches which examine the states of the N^2 VOQs and the N^2 XQs and make instantaneous scheduling decisions based upon the instantaneous states of the VOQs and/or the XQs. One such scheduling algorithm for buffered crossbar switches is described in the U.S. Patent Application by H. J. Chao et al, “Low Complexity Scheduling Algorithm for a Buffered Crossbar Switch with 100% Throughput”, U.S. patent application Ser. No. 11/967,125, Pub. No. 2008/015259 A1, which is hereby incorporated by reference.
The throughput of an N×M switch is defined as the average number of cells transmitted from the IPs per time-slot, or received at the OPs per time-slot, assuming no cells are dropped within the CIXQ switch. An ideal N×N CIXQ switch will maintain a sustained transmission rate of N cells per time-slot, equivalent to 100% throughput, provided the traffic demands through the switch do not violate the IP or OP capacity constraints. A sub-optimal scheduling algorithm for a CIXQ switch with XQs of finite size will occasionally find that (1) an IP cannot transmit a cell because all XQs in the row are full, and (2) an OP cannot receive a cell because all XQs in the column are empty.
The throughput efficiency of a CIXQ switch with a sub-optimal scheduling algorithm may be improved by making the XQs larger, for example increasing the XQ capacity to 4 or 8 cells par crosspoint. However a major problem with this approach is cost. Increasing the capacity of each of the N-squared XQs in the switching matrix to 4 or 8 cells would result in a significant increase in hardware cost, compared to a switch with 1 cell buffer per crosspoint. A 64×64 switch with an XQ capacity of 1 cell will require 4K cell buffers in the switching matrix. A 64×64 switch with an XQ capacity of 4 cells will require 16K cell buffers in the switching matrix. The larger XQs will result in significant increases in the VLSI area of the switch and the cost of the switch. They will also result in (a) larger number of cells queued within each switch on average, (b) in larger average delays for cells traversing the switch, and (c) in larger delay jitter for cells traversing the switch, and (d) a larger service lag for traffic traversing the switch.
Several prior papers describe dynamic scheduling algorithms wherein input ports make scheduling decisions based upon the instantaneous states of VOQs and/or XQs. However, this approach is impractical for large routers or switches. In a large router, the IO ports and the switching matrix may be physically separated by distances of 10-100 feet, in a large router. The design of a large buffered crossbar switch with a capacity of 4 Terabits per second by IBM (hereafter called the IBM switch) is described in the paper by F. Abel et al, “A Four-Terabit Packet Switch Supporting Long Round-Trip Times”, IEEE Micro, Vol. 23, No. 1, January/February 2003, pp 10-24, which is hereby incorporated by reference. This paper discusses the packaging of large switches and the impact of the large Round-Trip-Time (RTT) on the transmission lines associated with a large switch.
Electronic cables or short-distance parallel optical fibber ribbons are typically used to realize the transmission lines which interconnect the Input/Output Ports and the switching matrix. In the 4 Tbps IBM switch, the cables between the line-cards and switching matrix cards could be several hundred feet long. It can take up to 64 time-slots for a cell of data to traverse the cables from the IO ports to the switching matrix and visa-versa. Therefore, any dynamic scheduling algorithm where an IO port makes instantaneous scheduling decisions based upon the instantaneous states of the VOQs and/or XQs is impractical, as any information at an IF or OP on the states of the XQs can be many time-slots old and rendered useless, due to the large round-trip-time.
The design of a large buffered crossbar switch in CMOS VLSI is described in the paper by D. Simos, I. Papaefstathiou and M. G. H. Katevenis, “Building an FOC Using Large, Buffered Crossbar Cores”, IEEE Design & Test of Computers, November December 2008, pp. 538-548, which is hereby incorporated by reference. This switch uses credit-based dynamic schedulers, where buffer overflow in the switch is reduced by having queues transmit ‘credits’ to traffic sources. The credit schedulers and output schedulers operate in a round-robin order. This paper indicates that buffer overflow is a problem in CIXQ switches, due to the limited sizes of the XQs. This paper also indicates that a basic IQ switching matrix will require much smaller silicon VLSI area than an CIXQ switching matrix. The XQs in the CIXQ switch occupy the majority of the VLSI area in a CIXQ switch. It is well known that the final cost of a silicon CMOS chip is some exponential power of its VLSI area.
Ideally, an optimal scheduling algorithm for a CIXQ switch would achieve 5 requirements simultaneously: (1) it can sustain up to 100% throughput given any admissible traffic pattern: (2) it would minimize the amount of queueing in the IO ports and in the XQs in the switching matrix, (3) it would not make instantaneous scheduling decisions based upon the instantaneous states of the VOQs or XQs in the switching matrix, (4) it would have acceptable computational complexity, and (5) it will provide guarantees on the delay, jitter and QoS for all traffic traversing the switch. An optimal scheduling algorithm for a CIXQ switch would require small XQs with a capacity of approximately 1 cell buffer per XQ. To date, no distributed scheduling algorithm for a CIXQ switch has been proposed in the literature which can achieve essentially 100% throughput and provide QoS guarantees while requiring XQ sizes of approx. 1 cell buffer per crosspoint. The IQ switch scheduling algorithm described in the US patent application Pub. No. US 2007/0280961 A1 by T. H. Szymanski referenced earlier can be used to schedule traffic in a CIXQ switch while requiring XQs with a maximum capacity of 1 cell buffer per crosspoint. While that algorithm is very efficient, it schedules N-squared traffic flows through an input-queued N×N switch, and it recursively decomposes and schedules an N×N traffic rate matrix in a centralized processor, due to the coupling of the input and output ports. For a CIXQ switch where the input and output ports are decoupled, it is desirable to find a simpler scheduling algorithm, in this application, a new scheduling algorithm and new designs of the CIXQ switch are presented to achieve the above goals.
One scheduling algorithm for CIXQ switches is described in the paper“On Guaranteed Smooth Scheduling for Buffered Crossbar Switches”, by S M He, S T Sun, H T Guan, Q Zheng, Y J Zhao and W Gao, in the IEEE Transactions on Networking, Vol. 16, No. 3, June 2008, pp. 718-731 which is hereby incorporated by reference. This paper describes a scheduling algorithm called ‘sMUX’ to schedule the traffic on the N input ports and the N output ports of a CIXQ switch. However, the paper has several significant technical difficulties which are summarized.                (1) The iterative sMUX scheduling algorithm is identical to the well-known iterative ‘Generalized Processor Sharing-Weighted Fair Queueing’ (GPS-WFQ) scheduling algorithm, when the GPS algorithm is adapted for the situation of fixed-sized cells with guaranteed traffic rates.        The well-known GPS-WFQ algorithms are currently used in the Internet to provided fairness guarantees to traffic flows passing through an outgoing link or transmission line. The GPS-WFQ algorithms were developed by Parekh in his PhD thesis at MIT, and described in the paper by A. K. Parekh and R. G. Gallager, entitled “A Generalized Processor Sharing Approach to Flow Control in Integrated Service Networks: The Single Node Case”, IEEE/ACM Trans. Networking, vol. 1, pp. 344-337, 1993, which is incorporated by reference. A second paper by the same authors entitled “A Generalized Processor Sharing Approach to Flow Control in Integrated Service Networks: The Multiple Node Case”, IEEE/ACM Trans, Networking, vol. 2, no. 2, pp. 137-150, 1994 is incorporated by reference        (2) They present a theorem that a CIXQ switch can achieve essentially 100% throughput, while guaranteeing that each XQ has a capacity of 2 cells per crosspoint. The theorem assumes that a bounded delay jitter implies a bounded queue size. Our own simulations of their scheduling algorithm indicate that for large (ie 64×64 switches) the XQs should have a capacity of approx. 5 or 6 cells per crosspoint queue to achieve essentially 100% throughput, when using the proposed scheduling algorithm.        
Several prior papers also advocate the use of variable-size packets in CIXQ switches, IP packets typically vary in size from 64 bytes up to maximum of 1500 bytes. The typical maximum IP packet size of 1500 bytes is equivalent to about 24 fixed-sized cells of 64 bytes each. In CIXQ switches supporting variable-size packets, each XQ should contain enough memory to buffer the largest size IP packet, ie 24 cells. Therefore, the amount of memory required in a CIXQ switch with variable-size IP packets is at least 24 times the cost of the CIXQ switch with a single cell buffer per crosspoint. The 2nd problem is the increase in jitter and service lag when variable-size IP packets traverse the switch. A large packet reserves an IP port or an OP port (ie an IO port) for its entire duration, which increases the delay jitter and service lag experienced by all other packets contending for the same IO ports. In this document, we will primarily focus on synchronous CIXQ switches with fixed-sized cells, although our scheduling algorithm and switch designs apply to variable-size IP packets and switches supporting variable-size IP packets.