Input-buffered cell switches and packet routers are potentially the highest possible bandwidth switches for any given fabric and memory technologies, but such devices require scheduling algorithms to resolve input and output contentions. Two approaches to packet or cell scheduling exist (see, for example, A Hung et al, “ATM input-buffered switches with the guaranteed-rate property,” and A Hung et al, Proc. IEEE ISCC '98, Athens, July 1998, pp 331-335). The first approach applies at the connection-level, where bandwidth guarantees are required. A suitable algorithm must satisfy two conditions for this; firstly it must ensure no overbooking for all of the input ports and the output ports, and secondly the fabric arbitration problem must be solved by allocating all the requests for time slots in the frame.
Fabric arbitration has to date been proposed by means of the Slepian-Duguid approach and Paull's theorem for rearrangeably non-blocking, circuit-switched Clos networks (see Chapter 3, J Y Hui, Switching and traffic theory for integrated broadband networks, Kluwer Academic Press, 1990). This connection-level algorithm can be summarised as firstly ensuring no overbooking and secondly performing fabric arbitration by means of circuit-switching path-search algorithms. It has been assumed that this algorithmic approach could only be applied at the connection level, because of its large computational complexity. For this reason, proposals for scheduling of connectionless, best-efforts packets or cells employ various matching algorithms, many related to the “marriage” problem (see D Gale and L S Shapley, “College admissions and the stability of marriage,” Mathematical Monthly, 69, 9-15 (1962) and D Gusfield and R W Irving, The Stable Marriage Problem: Structure and Algorithms, MIT Press, 1989) in which the input-output connections for each time slot or phase of the switch are handled independently, i.e. a frame of time slots (and hence phases) is not employed. Although such algorithms for choosing a set of conflict-free connections between inputs and outputs for each time slot, which are based on maximum size and maximum weight bipartite graph matching algorithms, can achieve 100% throughput (N McKeown et al, “Achieving 100% throughput in an input-queued switch,” Proc. IEEE Infocom '96, March 1996, vol. 3, pp. 296-302) they are also impractically slow, requiring running times of complexity O(N3 log N) for every time slot (R E Tarjan, “Data structures and network algorithms,” Society for Industrial and Applied Mathematics, Pennsylvania, November 1983).
Iterative, heuristic, parallel algorithms such as iSLIP are known, which reduce the computing complexity (i.e. time required to compute a solution) for best-efforts packets or cells (N McKeown et al, “The Tiny Tera: a packet switch core,” IEEE Micro January/February 1997, pp 26-33). The iSLIP algorithm is guaranteed to converge in at most N iterations, and simulations suggest on average in fewer than log2 N iterations. Since no guarantees are needed, this and similar algorithms currently represent the preferred scheduling technique for connectionless data at the cell level in input-buffered cell switches and packet routers with large numbers of ports (e.g. N≧10). The iSLIP algorithm is applied to the Tiny Tera packet switch core, which employs Virtual Output Queueing (VOQ), in which each input port has a separate FIFO (First In, First Out) queue for each output, i.e. N2 FIFOs for an N×N switch. If we assume that each FIFO queue stores at least a number of cells equal to the average cell latency L, and that each cell is a 53-byte ATM cell, then the total input FIFO queue hardware count is O(424LN2). With each element capable of clocking out 424f bits per frame, this is a complexity product of O((424)2fLN2), which is a very large complexity. Fortunately, by employing a single queue in the form of RAM in each port, acting as N virtual queues, the hardware count can be reduced to O(424LN), and with parallel readout reducing the number of steps per frame to just f, the overall complexity product can be reduced to O(424fLN). Table 1 gives the hardware and “computing” steps for these queues to provide f cells within a frame.
For unicast packets the iSLIP algorithm converges in at most N iterations, where N is the number of input and output ports. On average the algorithm converges in fewer than log2 N iterations. The physical hardware implementation employs N round-robin grant arbiters for the output ports and N identical accept arbiters for the input ports. Each arbiter has N input ports and N output ports, making N2 links altogether. The total amount of hardware depends on the precise construction of the round-robin arbiters. N McKeown et al, op cit, employ a priority encoder to identify the nearest request from the port closest to a pre-determined highest-priority port (see FIG. 1). The priority encoder reduces the number of links down to log2 N parallel links, in order to change the pointer if required. The log2 N parallel links are then expanded back up to N links again through a decoder. Details of the hardware complexity of the arbiters are given in N McKeown, Scheduling Algorithms for Input-Queued Cell Switches, PhD Thesis, University of California, Berkeley, 1995. The growth rate for the complete scheduler is O(N4), each arbiter being O(N3). For a 32×32 cell switch (which is the size of the Tiny Tera switch), 421,408 2-input gates are required. This may be quite acceptable for such a small switch, but the O(N4) growth rate is extremely large.
In order to minimise the overall hardware and computing complexity, the best structure for constructing the encoder is a binary tree, which requires O(2N) elements (for large N) and only log2 N steps per iteration, whilst the decoder needs only O(N) elements. Pipelining cannot be employed to reduce to one step per iteration, because the pointers cannot be up-dated until the single-bit requests have passed through both sets of arbiters to the decision register. The total hardware and computing complexities are given below in Table 1. The hardware complexity now grows as O(N2) rather than O(N4), due to the binary tree encoder and decoder.
TABLE 1Hardware and computing complexities of the iSLIP algorithm forscheduling f packets per port in a frame of f time slots.HardwareComputing Steps perHardware.ComputingCountFrameComplexity ProductInput RAM424LNf424fLNqueuesAverageO(6N2)O(4flog2N(1 +O(24fN2log2N(1 +Convergencelog2N))log2N))GuaranteedO(6N2)O(4fN(1 + log2N))O(24fN3(1 +Convergencelog2N))The overall hardware.computing complexity product O(24fN3(1+log2 N)) of the iSLIP algorithm for scheduling f packets per port would be no less than that of the maximum size and weight matching algorithms of N McKeown, et al “Achieving 100% throughput in an input-queued switch,” Proc. IEEE Infocom '96, March 1996, vol. 3, pp. 296-302., if convergence must be guaranteed. There is a reduction to O(24fN2 log2 N(1+log2 N)) for the average number of computing steps. The major benefit of the iSLIP algorithm is its parallel nature, which allows the number of computing steps to be traded against hardware complexity, thus reducing computing times by a factor N2 at the expense of increasing the hardware by the same factor. It is interesting to note that hardware quantities for the input RAM queues far exceed those needed for the scheduling electronics.