The explosive demand for bandwidth over all sorts of communications networks has driven the development of very high-speed switch fabric devices. Those devices have allowed the practical implementation of network nodes capable of handling aggregate data traffic in a large range of values i.e., with throughputs from a few gigabit (109) to multi-terabit (1012) per second. To carry out switching at network nodes, today's preferred solution is to employ, irrespective of the higher communications protocols actually in use to link the end-users, fixed-size packet (or cell) switching devices. These devices, which are said to be protocol agnostic, are considered to be simpler and more easily tunable for performances than other solutions especially, those handling variable-length packets. Thus, N×N switch fabrics, which can be viewed as black boxes with N inputs and N outputs, have been made capable of moving short fixed-size packets (typically 64-byte packets) from any incoming link to any outgoing link. Hence, communications protocol packets and frames need to be segmented in fixed-size packets while being routed at a network node. Although short fixed-size packet switches are thus often preferred, the segmentation and subsequent necessary reassembly (SAR) they assume have a cost. Switch fabrics that handle variable-size packets are thus also available. They are designed so that they do not require or limit the amount of SAR needed to route higher protocol frames.
Whichever type of packet switch is considered they have however in common the need of an efficient flow control mechanism which must attempt to prevent all forms of congestion. To this end, all modern packet switches use a scheme referred to as ‘virtual output queuing’ or VOQ. As sketched in FIG. 1, all ingress port adapters or IA's (100) to a switch core (110) are temporarily storing incoming packets (105) in a ‘first come first served’ or FCFS order, generally in the form of linked list of packets (120) however, sorted on a per destination basis and more generally on a per flow basis (125). Depending on the type of application considered, flows can have to be differentiated not only by their destinations but also according to priorities or ‘class of service’ (CoS) and possibly according to other traffic characteristics such as of being a multicast (MC) flow of packets that must be replicated to be forwarded to multiple destinations as opposed to unicast (UC) flows. Hence, flows are differentiated with flow-ID's which include destinations and possibly many more parameters especially, a CoS.
Organizing input queuing as a VOQ has the great advantage of preventing any form of ‘head of line’ or HoL blocking. HoL blocking is potentially encountered each time incoming traffic, on one input port, has a packet destined for a busy output port, and which cannot be admitted in the switch core, because flow control mechanism has determined it is better to do so e.g., to prevent an output queue (OQ) such as (130) from over filling. Hence, other packets waiting in line are also blocked since, even though they would be destined for an idle output port, they just cannot enter the switch core. To prevent this from ever occurring, IA's input queuing is organized as a VOQ (115). Incoming traffic on each input port i.e., in each IA, is sorted per port destination (125) and in general per class of service or flow-ID, so that if an output port is experiencing congestion, traffic for other ports, if any, can be selected instead thus, has not to wait in line.
This important scheme of switch fabrics which authorizes input queuing without its drawback i.e., HoL blocking, was first introduced by Y. Tamir and G. Frazier, “High performance multi-queue buffers for VLSI communication switches,” in Proc. 15th Annu. Symp. Comput. Arch., June 1988, pp. 343-354. It is universally used in all kinds of switch fabrics that rely on input-queuing and is described, or simply assumed, in numerous publications dealing with this subject. As an example, a description of the use of VOQ and of its advantages can be found in “The iSLIP Scheduling Algorithm for Input-Queued Switches” by Nick McKeown, IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, April 1999.
The implementation of a packet switching function brings a difficult challenge which is the overall control of all the flows of data entering and leaving it. Whichever method is adopted for flow control, this always assumes that packets can be temporarily held at various stages of the switching function so as to handle flows on a priority basis thus supporting QoS (Quality of Service) and preventing switch to get congested. VOQ scheme fits well with this, allowing packets to be preferably held in input queues i.e., in IA's (100), before entering the switch core (110) while not introducing any blocking of higher priority flows.
As an example of this, FIG. 1 shows a shared-memory switch core (112) equipped with port OQ's (135) whose filling is monitored so as incoming packets can be held in VOQ's to prevent output congestion to occur. To prevent OQ's from ever overflowing, packets are no longer admitted when an output congestion is detected. Congestion occurs because too much traffic, destined for one output port or a set of output ports, is entering the switch core. As an elementary example of this one may consider two input ports each receiving 75% of their full traffic destined for a same given output port. This latter can only drain 100% of the corresponding traffic (ports IN and OUT typically, have identical speed) thus, the traffic in excess (50%) must be stored in the shared-memory and starts to build-up. If congestion lasts, and if nothing is done, the shared-memory fills up and the related OQ (130) soon overflows. Therefore, all OQ's are watched so as, if they tend to fill up, a feedback mechanism (140) prevents packets for a congested switch core output from leaving the corresponding VOQ's of IA's. This is easily achieved since VOQ's, in each ingress IA, are organized per destination as discussed above. Obviously, this is done on a priority basis i.e., lower priority packets are held first (although, for a sake of clarity, this is not shown in FIG. 1, VOQ's are organized per priority too) according to a series of thresholds (145) associated to the set of OQ's (135).
This scheme works well as long as the time to feed the information back to the source of traffic i.e., the VOQ's of IA's (100), is short when expressed in packet-times. However, packet-time reduces dramatically in the most recent implementation of switch fabrics where the demand for performance is such that aggregate throughput must be expressed in tera (1012) bits per second. As an example, packet-time can be as low as 8 ns (nanoseconds i.e.: 10−9 sec.) for 64-byte packets received on OC-768 or 40 Gbps (109 bps) switch port having a 1.6 speedup factor thus, actually operating at 64 Gbps. As a consequence, round trip time (RTT) of the flow control information is far from being negligible as this used to be the case with lower speed ports. As an example of a worst case traffic scenario, all input ports of a 64-port switch may have to forward packets to the same output port eventually creating a hot spot. It will take RTT time to detect and block the incoming traffic in all VOQ's involved. If RTT is e.g.: 16 packet-times then, 64×16=1024 packets may have to accumulate for the same output in the switch core. A RTT of 16 packet-times corresponds to the case where, for practical considerations and mainly because of packaging constraints, distribution of power, reliability and maintainability of a large system, port adapters cannot be located in the same shelf and have to interface with the switch core ports through cables. Then, if cables (150) are 10 meter long, because light is traveling at 5 nanoseconds per meter, it takes 100 nanoseconds or about 12 8-ns packet-times to go twice through the cables. Then, adding the internal processing time of the electronic boards, including the multi-Gbps serializer/deserializer (SERDES), this may easily add up to the 16 packet-times used in the above example.
Hence, when the performance of a large switching equipment approaches or crosses the 1 Tbps level, typically with 40 Gbps (OC-768) ports, RTT expressed in packet-time is becoming too high to continue to use a standard or back-pressure flow control mechanism such as the ones briefly discussed in FIG. 1. Because this type of flow control assumes that all IA's, independently, keep forwarding traffic to the switch core, and relies on the feedback (140) of flow control information to stop sending if a congestion is detected clearly, the reaction time becomes too high. When a congestion is detected, by the time it is reported to the sources, the situation may have dramatically worsen up to a point where it is no longer containable forcing to discard packets especially, if for an extended period of time, the traffic is biased to a single or a few output ports (hot spot).