1. Field of the Invention
The present invention relates generally to a cell based switch fabric. More specifically, a system and method for a cell based wrapped wave front arbiter (WWFA) with bandwidth reservation are disclosed.
2. Description of Related Art
In a network, various network stations such as computers are connected such as by a central network communication hub or a network communication switch. A network hub is simply a repeater that receives a data packet from any one network station and broadcasts or repeats the data packet to all the network stations to which it is connected. Each data packet contains a header to indicate the intended destination station for the packet and each network station receives the packet and examines the packet header to determine whether to accept or ignore the packet. One disadvantage of the hub is that the connection between each network station and the hub carries all packets broadcast by the hub, including not only packets directed to that network station but also packets directed to all other network stations.
In contrast, a network switch routes an incoming packet only to the intended destination station of the packet so that each network station receives only the packet traffic directed to it such that network switches are capable of handling multiple packet transmission concurrently. A network switch generally has input and output ports for receiving and transmitting packets from and to network stations and a switching mechanism to selectively route each incoming packet from an input port to the appropriate output port. The input port typically stores an incoming packet, determines the destination output port from the routing data included in the packet header, and then arbitrates for a switch connection between the input port and the destination output port. When the connection is established, the input port sends the packet to the output port via the switch.
A key component of interconnection networks is n×n switches. Interconnection networks that provide high bandwidth, low latency interprocessor communication enable multiprocessors and multicomputers to achieve high performance and efficiency. It is important for the n×n communication switches to have efficient internal architecture to achieve high performance communication.
A typical communication switch includes n input ports, n output ports, an n×n crossbar switching mechanism, and buffer memory. The communication switch receives packets arriving at its input ports and routes them to the appropriate output ports. The bandwidth of the input ports is typically equal to the bandwidth of the output ports.
However, there may be conflicting demands for resources such as buffer space or output ports resulting in delays in the traffic through switches. For example, when two packets destined for the same output port arrive at the input ports of the switch simultaneously, the packets cannot both be forwarded and at least one is buffered.
Because input ports may receive competing connection requests, a network communication switch typically provides an arbitration system for determining the order requests are granted in order to resolve conflicting resource demands and to provide efficient and fair scheduling of the resources in order to increase performance in interconnection networks.
The arbitration system may include an arbiter that receives connection requests from the input ports, monitors the busy status of the output ports, and determines the order that pending requests are granted when an output port becomes idle. When a request is granted by the arbiter, the arbiter transmits an acknowledgment to the corresponding input port, and may transmit control data to the switching mechanism to cause the switch to make the desired connection between the input and output ports. Upon receiving the acknowledgment, the input port transmits the data to the output port via the switching mechanism. The central arbiter typically assigns a priority level to each input and/or output port. The arbiter may, for example, rotate input and output port priority in order to fairly distribute connection rights over time.
An arbiter may employ a crossbar switching mechanism to allow arbitrary permutations in connecting the input buffers and/or inputs to output ports. An n×n crossbar has n horizontal buses (rows), each connected to an input buffer, and n vertical buses (columns), each connected to an output port. A horizontal bus intersects a vertical bus at a crosspoint where there is a switch that can be closed to form a connection between the corresponding input buffer and output port.
The design and implementation of various crossbar arbiters are discussed in Tamir and Chi, “Symmetric Crossbar Arbiters for VLSI Communication Switches,” IEEE Transactions on Parallel and Distributed Systems, Vol. 4, No. 1, 1993, pp. 13-27, the entirety of which is incorporated by reference herein. Tamir discloses a centralized wave front arbiter for an n×n crosspoint switch routing data between n network stations. The arbiter includes an n×n array of arbitration units (which Tamir refers to as cells), one for each possible connection of the crosspoint switch. Each input port corresponds to one row of arbitration units and supplies a separate request signal to each arbitration unit of the row. Each output port corresponds to one column of arbitration units and supplies a separate busy signal to each arbitration unit of the column. The arbitration units are ranked according to priority. When an input port seeks a connection to an output port it asserts the one of n output request signals. The asserted request signal drives the arbitration unit in the column corresponding to the output port. That arbitration unit grants a request when not otherwise inhibited from doing so by a higher priority arbitration unit. Priority is periodically shifted from arbitration unit to arbitration unit using token passing rings to provide equitable allocation of connection rights to both input and output ports.
FIG. 1 illustrates an exemplary communication switch 50 that supports three input ports 58 and three output ports 60. The communication switch 50 includes a 3×3 crossbar switching mechanism 52, a 3×3 arbiter 54, and a set of nine virtual output queues 56. Each path from a given input port i 58 to a given output port j 60 can be referred to as a virtual link VLij. In a communication switch supporting a single service (queuing) priority, there is one virtual output queue Qij 56 associated with each virtual link VLij. Information that is to be transferred to output port j 60 arriving at input port i 58 is temporarily stored in virtual output queue Qij 56. Since the communication switch 50 shown switches fixed length packets, referred to as cells, information arriving at the communication switch 50 is converted into cells prior to being stored in a corresponding virtual output queue.
A cell period is associated with the communication switch 50. During a given cell period, one cell may be transferred to each of the three output ports 60 from the virtual output queues 56. The three by three crossbar switching mechanism 52 is used to perform this transfer. As shown, the three by three crossbar switching mechanism 52 includes nine switch units (SU0,0 through SU2,2) arranged in a three by three matrix. A given switch unit SUij is used to transfer the contents of virtual queue Qij to output port j, thereby establishing virtual link VLij. Thus, switch unit SUij is dedicated to the establishment of virtual link VLij.
The switches within the crossbar 52 may be reconfigured every cell period, such that during each cell period, a maximum of three cells are able to be transferred from the virtual output queues to the three output ports (one cell per output port). Based upon the construction of the crossbar switching mechanism shown, during any given cell period, only one cell may be transferred from a given input port, and only one cell may be transferred to a given output port. Otherwise, two cells would conflict with one another within the switching mechanism. It follows then that during any given cell period, only one switch within any given column of the crossbar matrix may be closed, and (for the non-broadcast case) only one switch within any given row of the crossbar matrix may be closed. For example, if the switches in both switch unit SU0,0 and switch unit SU2,0 were closed simultaneously, then neither the cell from input port 0 or input port 2 would be successfully transferred to output port 0. Thus, a maximum of three crossbar switches may be closed within the three by three crossbar matrix 52 during any given cell period.
It is the responsibility of the three by three arbiter 54 to determine which three crossbar switches are closed during any given cell slot period. In essence, the arbiter 54 attempts to “match” three input ports to three output ports during each cell period. As is evident from FIG. 1, each virtual output queue Qij has an associated switch unit SUij and an associated arbitration unit AUij. If a given virtual output queue Qij contains at least one cell, then the corresponding request signal Rij is activated at its corresponding arbitration unit AUij. During every cell period, the arbiter 54 examines all the activated requests and selects up to three non-conflicting crossbar switch settings. The three selected crossbar switches are then closed and, by activating the corresponding Grant Gij signals, the three corresponding virtual output queues are granted permission to transfer a cell. Once the cells are transferred to the output ports, the process is then repeated during the following cell period.
The arbiter shown in FIG. 1 may be implemented as a wave front arbiter (as described in Tamir) in order to make its selections. FIG. 2 is a block diagram of an exemplary 3×3 wave front arbiter 60. In its simplest implementation, priority is given based upon arbitration unit location. In such an implementation, arbitration units above have higher priority than arbitration units below, and arbitration units to the left have higher priority than arbitration units to the right.
Each arbitration unit of the 3×3 arbiter 60 has a top (North) and a left (West) input. When deactivated, the top input of an arbitration unit indicates that the output port of the associated column has been granted to some other input port during the current cell period, while the left input, when deactivated, indicates that the input port of the associated row has been matched to some other output port during the current cell period. If either the top input or the left input of a given arbitration unit is deactivated, then that arbitration unit will not issue a grant for that cell period. If the top input, the left input, and the request signal are all activated at a given arbitration unit, then that arbitration unit will issue a grant for that cell period.
In the 3×3 arbiter 60, the top left arbitration unit has the highest priority and thus the arbitration process starts at AU0,0. Since the top and left inputs of arbitration unit AU0,0 are permanently activated, if request signal R0,0 is activated, grant signal G0,0 will be activated and AU0,0's bottom (South) and right (East) output signals will be deactivated, thus indicating that input 0 and output 0 are no longer available for matching. If request signal R0,0 is deactivated, then AU0,0's bottom and right output signals will be activated, thus indicating to the diagonally downstream arbitration units that input 0 and output 0 are available for matching. After arbitration unit AU0,0 finishes its processing, its bottom and right output signals are forwarded to arbitration units AU0,1 and AU1,0, and a similar process is performed at each of those two arbitration units. Likewise, once AU0,1 and AU1,0 finish their processing, their output signals are forwarded to arbitration units AU0,2, AU1,1, and AU2,0.
As is evident, the order of processing of input information within the arbiter can be likened to a wave front moving diagonally from the top left arbitration unit down to the bottom right arbitration unit. When the bottom right arbitration unit finishes its processing, then the arbitration cycle for that cell period has been completed. For the 3×3 wave front arbiter 60, the wave front moves through the arbiter in five steps. At each step, each arbitration unit can independently perform its processing since at each step no two arbitration units share the same input row or output column.
The simple wave front arbiter 60 shown in FIG. 2 distributes bandwidth unfairly since the top left arbitration unit will always receive as much bandwidth as it needs and the other arbitration units receive decreasing amounts of bandwidth depending upon their physical relationship relative to the top left arbitration unit. Tamir somewhat rectifies this situation by rotating the top priority from one arbitration unit to another over some period of time. Rotating priority has the effect of distributing bandwidth to each virtual link equally. However, in real world applications, all virtual links do not typically require the same amount of bandwidth and for such applications the wave front arbiter with rotating priority still does not distribute bandwidth fairly.
As noted by Tamir, it is possible to decrease the processing time of the overall arbitration cycle by utilizing a “wrapped” wave front arbiter. A wrapped wave front arbiter can be formed from a wave front arbiter by feeding the bottom outputs of the bottom row arbitration units into the top inputs of the top row arbitration units, and similarly feeding the right outputs of the rightmost column arbitration units into the left inputs of the leftmost column of arbitration units. In an n×n wrapped wave front arbiter, n arbitration units that do not share a common row or a common column within the arbiter can be grouped together to form an arbitration stage. FIG. 3 is a block diagram of a 3×3 wrapped wave front arbiter 64 with three arbitration stages. As can be observed, arbitration units AU0,0, AU2,1, and AU1,2 can be grouped into one stage (stage 0), since none of these arbitration units share a common input row or output column. Similarly, arbitration units AU0,1, AU1,0, and AU2,2 can be grouped into another common stage (stage 1) and arbitration units AU0,2, AU1,1, and AU2,0 can be grouped into yet another common stage (stage 2). In an n×n wrapped wave front arbiter, all n arbitration units within a given stage simultaneously process their input signals. Once one stage has completed its processing, the arbitration wave front moves to the next stage, and the n arbitration units of that next stage process their input signals. This continues until the arbitration units of all stages complete their processing. As is evident, an n×n wrapped wave front arbiter requires only n steps to complete its processing. This is less than the 2n−1 steps required for the n×n wave front arbiter of FIG. 2.
In a wrapped wave front arbiter with fixed stage priority, one stage is declared to have priority over all other stages, and therefore during each cell period the n arbitration units of that highest priority stage always process their inputs first. For instance, if stage 0 is declared to have the highest priority in the arbiter 64 shown in FIG. 3, then the arbitration units of stage 0 process their inputs at the start of each arbitration cycle, and then forward their outputs to the arbitration units of stage 1. Once the arbitration units of stage 1 finish processing their inputs, the stage 1 arbitration units forward their outputs to the arbitration units of stage 2. After the arbitration units of stage 2 finish processing their inputs, the arbitration cycle is complete.
A wrapped wave front arbiter with fixed stage priority also distributes bandwidth unfairly since the arbitration units associated with the highest priority stage always receive as much bandwidth as they want, and the arbitration units of the other stages receive decreasing amounts of bandwidth depending upon their physical relationship to the highest priority stage. Tamir once again somewhat rectifies this situation by rotating the top priority from one stage to another over some period of time. As was the case for the wave front arbiter with rotating priority, rotating stage priority within the wrapped wave front arbiter (as shown in FIG. 4) has the effect of distributing bandwidth to each virtual link equally. However, as noted previously, in real world applications, virtual links typically do not all require the same amount of bandwidth, and for such applications, the wrapped wave front arbiter with rotating stage priority will not distribute bandwidth fairly.
In order to gain an understanding of how a wrapped wave front arbiter may distribute bandwidth unfairly, consider the following case. Assume that all the virtual output queues in the FIG. 3 system contain an unlimited number of cells (i.e., there is constantly backlogged traffic). In such a situation, all nine request signals Rij will always be activated, as each virtual link will attempt to garner as much bandwidth as possible. Assume also that stage priority is rotated every cell period, such that stage 0 has the highest priority during cell period 0, stage 1 has the highest priority during cell period 1, stage 2 has the highest priority during cell period 2, etc. (as illustrated in FIGS. 4A-4C). In this situation, virtual links VL0,0, VL2,1, and VL1,2 will be granted one cell of bandwidth during cell period 0, virtual links VL0,1, VL1,0, and VL2,2 will be granted one cell of bandwidth during cell period 1, virtual links VL0,2, VL1,1, and VL2,0 will be granted one cell of bandwidth during cell period 2, etc. Thus it can be observed that each of the three virtual links that desire the bandwidth of a given output port will receive exactly ⅓rd of the bandwidth of that port.
Now assume that the three virtual links associated with stage 0 (i.e., VL0,0, VL2,1, and VL1,2) no longer have any cells in their associated virtual output queues, while the virtual output queues associated with the other two stages still contain an unlimited amount of cells. In this situation, when an arbitration wave front is initiated at stage 0 during cell period 0, none of the arbitration units associated with stage 0 will claim bandwidth, and therefore the wave front will pass on to the stage 1 arbitration units. Since the virtual links associated with stage 1 desire unlimited bandwidth, the bandwidth associated with the stage 0 initiated wave front is granted to the three “stage 1” virtual links. Following the arbitration cycle associated with cell period 0, an arbitration wave front is initiated at stage 1 during cell period 1, and the bandwidth associated with that wave front is also granted to the three virtual links associated with the stage 1 arbitration units. During cell period 2, an arbitration wave front is initiated at stage 2, and the bandwidth associated with that wave front is granted to the three virtual links associated with the stage 2 arbitration units. As can be observed, the three virtual links associated with stage 1 unfairly garner ⅔rds of the bandwidth of the output ports, while the three virtual links associated with stage 2 garner only ⅓rd of the bandwidth of the output ports.
In the example, the stage 1 virtual links benefit from the fact that the stage 1 arbitration units are physically located downstream from the stage 0 arbitration units, while the stage 2 virtual links suffer from the fact that the stage 2 arbitration units are not physically located downstream from the stage 0 arbitration units. It should be noted that the effects illustrated above can be somewhat mitigated by randomizing the application of the requests to the arbiter. However, randomizing the application of the requests to the arbiter further complicates the communication switch while still not providing a mechanism for allocating bandwidth in a flexible manner. For instance, in the previous example, there may be a desire to allocate ⅗ths of the output bandwidth to the virtual links associated with stage 2 and ⅖ths of the bandwidth to the virtual links associated with stage 1. Randomizing the application of the requests to the arbiter does not address this problem.
Several schemes have been proposed to allocate bandwidth through a crossbar switch using a variety of methods. Static scheduling is discussed in T. Anderson, S. Owicki, J. Saxe, and C. Thacker, “High Speed Switch Scheduling for Local Area Networks,” Proc. Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V), October 1992, pp. 98-110 (also appeared in ACM Transactions on Computer Systems 11, 4, November 1993, pp. 319-352 and as Digital Equipment Corporation Systems Research Center Technical Report #99); weighted probabilistic iterative matching is discussed in D. Stiliadis and A. Varma, “Providing Bandwidth Guarantees in an Input-Buffered Crossbar Switch” Proceedings of INFOCOM'95, Boston, Mass., April 1995, and matching using weighted edges is discussed in A. Kam and K. Siu, “Linear Complexity Algorithms for QoS Support in Input-queued Switches with no Speedup,” IEEE Journal on Selected Areas in Communications, June 1999, vol. 17, (no. 6):1040-56.
In Anderson, a scheduling frame with a fixed number of slots is defined, and bandwidth is statically allocated to input/output pairs (i.e., virtual links) by manually configuring a table. The table, which spans the period of time associated with the scheduling frame, is then used to reconfigure the switch fabric every cell period. The pattern associated with reconfiguring the switch fabric repeats itself every scheduling frame period. There are two potential problems associated with the static scheduling method presented by Anderson: 1) reconfiguring the tables is complex and time consuming and thus may limit the connection establishment rate associated with the switch, and 2) bandwidth may go unused during reserved slots where connections have no data available.
In Stiliadis, an input/output matching algorithm is presented that uses two passes through the arbiter per cell period. During the first pass, only those virtual links that have “bandwidth credits” are allowed to compete for bandwidth, while during the second pass, those virtual links with and without bandwidth credit are allowed to compete for bandwidth. A frame period is defined, and each virtual link is given an amount of bandwidth credits at the start of each frame period. Once a given virtual link runs out of credits, it can no longer compete for bandwidth during the first pass through the arbiter. There are three potential problems associated with Stiliadis's algorithm: 1) an iterative matching algorithm is used and therefore the implementation of the algorithm might be considered more complex than the implementation of the simple wrapped wave front arbiter algorithm, 2) credits are issued to each virtual link at the start of each frame period associated with the arbiter and therefore (as discussed by Stiliadis) bursty connections may increase the delay of other connections through the associated switch, and 3) the algorithm does not allow connections with low delay requirements (and reserved bandwidth) to take priority over connections with no delay requirements (and reserved bandwidth).
In Kam, a method of matching inputs to outputs using weighted virtual links is presented. Kam uses a matching algorithm (the Central Queue Algorithm) that matches inputs to outputs in sequential order based upon weights that are attached to each input/output pair. Since each match is performed sequentially starting with the highest weighted pair, when applied to a switch with n inputs, the algorithm may require n2 steps to complete. (It should be noted that the wrapped wave front arbiter requires a maximum of n steps to complete its matches, since it attempts to match n pairs in parallel at each step. Furthermore, as pointed out in H. Chi and Y. Tamir “Decomposed Arbiters for Large Crossbars with Multi-Queue Input Buffers,” Proceedings of the International Conference on Computer Design, Cambridge, Mass., October 1991, pp. 233-238, the wrapped wave front arbiter may be easily pipelined, and therefore practical wrapped wave front arbiters can be constructed for switches containing large values of n.) Kam also presents an algorithm that uses two passes through the arbiter per cell period. In the first pass, those input/output pairs with credits are allowed to compete for bandwidth, while during the second pass all the input/output pairs without credit are allowed to compete for bandwidth. In addition, Kam offers a method of giving higher precedence to those input/output pairs requiring lower delay guarantees by rescaling the credit weights. There are two potential problems associated with Kam's algorithms: 1) the long arbitration cycle associated with the Central Queue Algorithm may make it impractical to implement when the number of inputs is large, and 2) all of the algorithms presented in Kam require a sorting process (of varying complexity) prior to performing the matching algorithm, thus adding to the complexity of the overall implementation.
U.S. Pat. Nos. 6,430,181, 6,044,061, and 6,072,772 also describe various other arbitration mechanisms which also suffer from many of the drawbacks discussed above such as complex sorters, sequential matching algorithms, elements of unfairness, etc.
An example of an arbiter is a wrapped wave front arbiter. Unlike some conventional arbiters, a wrapped wave front arbiter is simple to implement and therefore offers an attractive and practical method of performing arbitration within a communication switch. However, for cases where all input ports are not requesting bandwidth on all output ports, the wrapped wave front arbiter does not resolve conflicting resource demands in a fair manner. Conventionally, the unfairness problem is addressed by under-utilizing the switch fabric or by providing switch fabric speedup. However, these options are not ideal and may not be able to be incorporated into certain systems. In addition, conventional wave front arbiters do not guarantee the proportional amounts of bandwidth to inputs and outputs of a switch fabric when nonuniform traffic patterns are applied to the switch inputs.
Thus, it is desirable to provide a wave front arbiter that addresses the unfairness problem associated with conventional wave front arbiters and/or allows bandwidth reservation within a wave front arbiter-based switch fabric.