1. Field of the Invention
The invention relates to packet-based switching fabrics, and more particularly to a load balancing method and apparatus for selecting an appropriate next-stage module for transmission of a data packet in the presence of multicast capability.
2. Description of Related Art
A switch fabric for a data network is a device that allows data from any of several input ports to be communicated switchably to any of several output ports. Early data networks were based on circuit switching, in which fixed routes were established through the fabric for each session. The peak bandwidth demand of each session was allocated to the route for the entire duration of the session. When session traffic was bursty, however, circuit switching resulted in under-utilization of network resources during the time between bursts. Packet switching was developed to overcome this disadvantage, thus improving the network utilization for bursty traffic.
Packet switched networks dynamically allocate bandwidth according to demand. By segmenting the input flow of information into units called “packets,” and processing each packet as a self-contained unit, packet switched networks allow scheduling of network resources on a per-packet basis. This enables multiple sessions to share the fabric resources dynamically by allowing their packets to be interleaved across the fabric. Typically each packet includes a header indicating its destination port, and the fabric includes a routing mechanism for determining a route through the fabric, on a per-packet basis. The present invention is concerned primarily with a routing mechanism for packet switched networks rather than circuit switched networks.
Small switching fabrics can be constructed from crossbar switches, in which input ports are connected to the rows of a grid and the output ports are connected to the columns of the grid (or vice-versa). Each input port then can be connected to any output port merely by activating the switch at the grid junction at which they intersect. Multicast data flow can be supported just as easily, by turning on more than one junction switch to connect more than one output port to a single input port.
Crossbar switches do not scale well to larger fabrics. Many larger fabrics therefore use a multi-stage network topology, in which switching from a number of input ports to a number of output ports is accomplished through one or more intermediate stages. Each stage can have one or more module, each implementing its own internal switch. In addition, in a fully connected network, all of the modules in each stage of the network have respective communication paths to all of the modules in the next stage. A basic network of this sort has three stages (input, intermediate and output), but networks with any odd number of stages theoretically can be constructed by replacing the modules in any given stage with smaller multi-stage networks in recursive fashion.
A special case of multi-stage switch networks was studied by Clos in C. Clos, “A Study of Non-Blocking Switching Networks”, Bell System Technical Journal, March 1953, vol. 32, No. 3, pp. 406-424, incorporated by reference herein. A so-called Clos network has three stages, any of which can be recursed to create effectively a network with a larger odd number of stages. All input stage modules (sometimes simply called “input modules”) of the network have an equal number of input ports, all output stage modules (sometimes simply called “output modules”) have an equal number of output ports, and all input and output modules are fully interconnected with all intermediate stage modules (sometimes simply called “intermediate modules”). Clos networks can be symmetric, in which case the number of modules and the number of ports per module on the input side match the corresponding values on the output side, or they can be asymmetric, in which case the number of modules or the number of ports per module on the input side do not necessarily match the corresponding values for the output side. A symmetric Clos network, therefore, can be characterized by a triple (m, n, r) where m is the number of modules in the intermediate stage, n is the number of input ports on each input module (the same as the number of output ports on each output module), and r is the number of modules in the input stage (the same as the number of modules in the output stage). An asymmetric Clos network must be characterized by a quintuple (m, n1, r1, nO, rO). The invention is most useful in Clos networks, but under proper circumstances it can also be used in multi-stage networks that do not strictly meet the definition of a Clos network.
Multi-stage networks scale better than pure crossbar switch networks, to a point, but also introduce the possibility of blocking operation. That is, because data from more than one input port have to share the same intermediate modules, a possibility exists that when data is ready for transmission, all possible routes to the output module having the desired destination output port might be blocked by other data flows. Theoretical formulas exist for calculating the minimum required number of intermediate stage modules and stage-to-stage data link rates in order to provide non-blocking operation given specified maximum input and output port numbers and data rates, but these minimum requirements are only necessary conditions; they are not necessarily sufficient by themselves to achieve non-blocking operation. Networks also must be designed to choose appropriate routes through the intermediate stage modules for individual data, packets, and to backpressure them properly.
For example, consider a 3-stage Clos network having two input modules, two output modules, two input ports on each input module, and two output ports on each output module. Assume further that the maximum data rate per input port, the maximum data rate per output port, and the stage-to-stage link data rate, are all R. Then a necessary condition to non-blocking operation is that there be at least two intermediate stage modules. This can be seen because the total output capacity of a given one of the input modules would be 2R (R to each of the two intermediate stage modules), which is no less than the maximum total input data rate of the input module, which in this case is also 2R (R from each of the two input ports to the module). The same is true for every other module in the network. However, assume now the extreme case that the routing algorithm employed by a given one of the input modules is to always send all input packets to the first intermediate stage module and never to the second. In this case, since the data rate from an input module to a single intermediate stage module is only R, the fabric will be able to transport only half the combined data rate that was promised to the two input ports of that module, and the fabric will have to block packets from one or the other or both of such input ports whenever their combined input data rate exceeds R.
The algorithm used by an input module to decide which intermediate module to send the next packet to is known variously as a load balancing, channel balancing, or striping algorithm. Much research has been conducted into optimum load balancing algorithms. Many of the algorithms apply only to the older circuit switched networks, but many others apply to packet switched networks. The algorithms applicable to packet switched networks are the ones of interest in the present discussion.
It will be appreciated that striping algorithms are different from “fair queuing” algorithms, or queue scheduling algorithms, the purpose of which are to select which of a plurality of non-empty input queues the next packet is to be taken from for transmission across the fabric. Typically an input module requires both kinds of algorithms: a fair queuing algorithm to determine which input queue to service next, and then a striping algorithm to determine how to route the next packet from the input queue chosen by the fair queuing algorithm. A duality does exist between the two kinds of algorithms, but only in certain circumstances can a fair queuing algorithm be converted directly to a load balancing algorithm or vice versa. For example, whereas it might be desired to formulate a striping algorithm that will achieve certain goals under a particular set of striping conditions, there may be no useful dual of such a striping algorithm in the fair queuing arena because there is no useful dual of the goals or set of conditions in the fair queuing arena. In such a situation, it might not be intuitive that direct conversion of any known fair queuing algorithms will be optimal as a load balancing algorithm under the set of conditions for which a striping algorithm is being developed.
A good striping algorithm should be able to minimize the probability of blocking operation while utilizing all of the available channels in proportion to their respective capacities. One way to achieve these goals might be through the use of a global supervisor that is continually aware of queue lengths in all channels, and uses this information to choose the best route for the next packet. This solution does not scale well, however, for a number of reasons. First, as the number of input and output ports grow, and channel data rates increase, it becomes increasingly difficult to design logic circuitry that is fast enough to make all the required calculations in time for each packet. Second it also becomes increasingly difficult to design in sufficient control signal capacity to transmit the information from all the various queues in the network back to the supervisor. The latter problem is only exacerbated when the various ports, queues and routes are spread out over multiple chips, boards or systems.
Because of these problems, a number of different striping algorithms have been developed for three-stage networks which do not require direct knowledge of downstream queue lengths. These algorithms therefore avoid (or at least reduce the amount of) control signaling required across the network. Because these algorithms rely on probabilities rather than determiinistic calculations, they achieve the goals of non-blocking operation and fair channel usage with varying degrees of success in different circumstances.
In one such algorithm, known as round robin (RR) striping, packets are sent from the input stage to the intermediate stage modules in a round-robin order. This algorithm is generally simple to implement, but it does not take account of different bandwidth capacities available on different channels. For switching fabrics having different capacities on different channels, a weighted round robin (WRR) striping algorithm is known, in which during each round robin cycle, the number of packets transmitted on each channel is proportional to the capacity of that channel. Both round robin and weighted round robin striping algorithms achieve the goals of non-blocking operation and fair channel usage best when the algorithm can be implemented globally across all input queues. In many kinds of fabrics, however, the input queues are distributed across multiple input modules. Coordination among the input queues becomes more, and more difficult as the number of input modules increases, thereby stifling the scalability of the network. In this case it is known to allow each input module to implement its own round robin or weighted round robin striping, without coordinating with the other input modules. This leaves open a small risk that two or more modules will synchronize, but that risk is accepted or otherwise avoided in various implementations.
Round robin and weighted round robin striping algorithms, however, do not optimize load balancing when the packet size is variable. As an example, consider a Clos network having two modules in the intermediate stage, equal data rates on all channels, and a sequence of packets to send which alternate in size between large and small. In this case an input module implementing a round-robin striping algorithm will alternate striping between the two intermediate stage modules and will do so synchronously with the packet size. All the large size packets will therefore be sent through one of the intermediate stage modules (call it intermediate stage module #1) while all the small size packets will be sent through the other intermediate stage module (call it intermediate stage module #2). The algorithm therefore does not maximally utilize all of the available channels in proportion to their respective capacities. Nor does it ensure non-blocking operation, because the fabric might have to hold up a large size packet while it waits for the output queue of intermediate stage module #1 to empty. If the small size packet behind the large size packet has already arrived into the input module, its transmission will be blocked even if the route through intermediate stage module #2 is clear. Still further, if the traffic is not well balanced across the links, then some links may be oversubscribed, i.e., presented with traffic whose rate exceeds that of the link. In the event that this imbalance persists for long enough, the node that oversubscribes the link can accumulate excess traffic until it overflows and is forced to drop packets.
In order to address issues of variable packet size, a striping algorithm known as deficit round robin (DRR) has been developed. According to the DRR algorithm, a deficit count is maintained for each channel. Before packets are sent on a current channel, a quantum is added to the deficit count for that channel. If channel capacities differ, then the quantum for each channel can be proportional to the relative capacity of that channel (Deficit Weighted Round Robin—DWRR). Then, if the length of the packet is smaller than the deficit count for the current channel, then the packet is sent on that channel and the deficit counter for that channel is reduced by the length of the packet. The sender continues sending packets on the current channel, concomitantly reducing the deficit count for that channel, until the length of the next packet to send is greater than the deficit count for the current channel. The sender then moves on to the next channel in round robin sequence, adds the quantum to the deficit count for the new channel, and tests the count against the length of the new packet. As with RR and WRR, DRR and DWRR algorithms can be implemented in a distributed manner to thereby improve scalability.
DRR and DWRR can be very good at avoiding blocking situations and using all channels in proportion to their respective capacities, but it is not believed that these algorithms have been considered for use in a switching fabric. An example of DRR striping is described for example in H. Adiseshu, G. Parulkar, and G. Varghese, “A Reliable and Scalable Striping Protocol,” in Proceedings of ACM SIGCOMM '96, pp. 131-141 (1996), incorporated by reference herein, but only for the problem of selecting among multiple parallel routes from a single source node to a single destination node. It is not clear from this paper how to adapt the algorithm for use in a multi-stage switching fabric, which usually includes multiple input nodes, multiple destination nodes, and multiple routes from each input node to each destination node, some of which share common data paths for part of the route (e.g. the part of the route from the input module to the intermediate stage modules.)
The DRR and DWRR load balancing algorithms also do not address the problems created by a multicast replication capability in downstream modules. In many situations it is desirable for one node of a network to communicate with some subset (proper or improper) of all the nodes in the network. For example, multi-party audio and video conferencing capabilities and audio and video broadcasting to limited numbers of nodes are of considerable interest to users of packet-switched networks. To satisfy such demands, packets destined for several recipients typically are transmitted from a source to a point in a network at which the packets are replicated and forwarded on to all recipients in the multicast group. Multicast routers have been developed which perform the replication service. Since demand for these kinds of services is increasing, it would be desirable to design a new switch fabric architecture for use in many different kinds of equipment including multicast routers and other multicasting elements. Thus it would be extremely desirable if the switch fabric architecture would include multicast replication capability.
Multicast replication is advantageously performed as close as possible to the output ports of the fabric. If the replication were to be performed in the input modules, then each replica could be considered as a separate packet and striped effectively using DRR or DWRR. But then multiple identical packets would be traversing the fabric unnecessarily and a significant fraction of the fabric's overall capacity could be impacted. Thus if two or more members of the multicast group are reached through output ports on a single output module, then replication of the packets for those members of the group is advantageously delayed until the packets reach that output module. If two members of the multicast group are reached through different output modules, then replication of the packets for those members must be performed in the intermediate stage modules. In a fully connected multi-stage switching fabric, it is rarely necessary to replicate packets in the input modules.
Because multicast replication is rarely performed at the input ports of a fabric, multicast capability in a switch fabric can be problematical for a striping algorithm. When a packet flow is replicated in an intermediate stage module and then sent to two or more different output modules, the bandwidth utilization of the paths from the intermediate stage to the output stage differs from that of a fabric that does not perform multicast replication. In addition, whereas in a unicast fabric only one intermediate stage output queue is affected by each packet sent from an input module, in a multicast fabric, many intermediate stage output queues can be affected. Neither of these considerations are taken into account in the DRR and DWRR load balancing algorithms. Without modification, therefore, a fabric that is capable of multicast replication will not achieve the goals of minimum risk of blocking operation and fair utilization of channel capacity if it attempts to use a known DRR or DWRR striping algorithm.
One might consider developing a global supervisor that directly observes the queue lengths and the packets in-flight to each output module, and selects the best route for each next packet in dependence upon this information. Such a supervisor could successfully achieve full throughput and full usage of channel capacity, but as previously mentioned, a global supervisor does not scale well. Thus whereas a striper implemented in a global supervisor might be adequate for small fabrics, it would not be adequate for larger fabrics. A switching fabric architecture that relied on such a striper therefore would be limited in application only to small systems.
Accordingly, there is an urgent need for a switch fabric architecture that can achieve full throughput and maximum channel usage, and that is applicable to a wide variety of network elements, including satisfaction of the increasing need for fabrics supporting multicast replication. As an important part of achieving these goals, there is an urgent need for a new striping algorithm that continues to minimize the blocking risk and maximize fair channel utilization, whether or not the fabric's multicast capability is exercised. Preferably such a striping algorithm can be implemented in a distributed manner, so as to find applicability in both small and large fabrics.