Packet Switch
Packet switch equipment (hereinafter, also switches) transfers packets from their input ports to specified output ports. With distributed switching arrangement, a switch architecture would include line cards that host the input/output (I/O) ports, and a switch fabric, responsible for transporting packets from ingress line cards (hereinafter, sources) to egress line cards (hereinafter, destinations, dests). A line card is connected to the switch fabric using fabric port(s).
A cost-effective switch fabric could be realized using an off-the-shelf packet processor capable of switching packets among its fabric ports and enhanced with congestion control mechanisms to throttle the sources, in order to prevent congestion at the switch fabric.
FIG. 1 illustrates a basic architecture of a distributed switch with N sources. Each line card is subdivided to an ingress block (source) and an egress block (dest). The source block receives packets from I/O ports and sends them to the switch fabric via fabric ports. The packets would typically be appended with a header that comprises forwarding info (e.g., to which fabric port should the switch fabric send the respective packet), as well as QoS info (e.g., packet priority). The dest block receives packets on fabric ports and sends them to I/O ports. For the sake of simplicity, it shall be assumed that each source is connected to the switch fabric via a single fabric port.
QoS & TM
Advanced switches support quality of service (QoS) for service differentiation. A service could be regarded as a logical flow from one service endpoint to another. The flow could be carried by a traffic engineered MPLS tunnel for QoS support while it is propagated across a provider network.
QoS support requires traffic management (TM) mechanisms, such as buffering for burst absorption, shaping for rate (capacity, bandwidth) limiting and classification (committed/excess traffic), and scheduling for prioritization and bandwidth (BW) fairness. The QoS challenge of a distributed switch is to provide end-to-end (E2E) BW guarantees per service, from the source I/O port to the destination I/O port.
For the sake of simplicity, it shall be assumed that a switch supports two priority grades, a high (H) priority and a low (L) priority. H priority traffic expects improved delay performance (e.g. minimal delay), while L priority traffic can tolerate higher delays. However, traffic of both priorities should be provided with E2E BW guarantees.
The following TM mechanisms may be considered in general:
A virtual output queue (VoQ) holds traffic in memory buffers. A dedicated VoQ per destination avoids the so-called head of line (HoL) blocking that would otherwise occur when one destination is congested while another is not. The size of the VoQ provides an indication to the number of packets that can be stored thereat. A VoQ whose traffic is “bursty” by nature, would typically require larger size to effectively absorb traffic bursts. The VoQ buffering is split to a guaranteed portion, typically set according to the Committed information rate (“CIR”), and an excess portion, typically set according to the excess information rate (“EIR”).
A shaper limits the traffic rate that goes to same Dest-Prio (i.e., same destination with the same priority) The shaper may be for example of a dual-rate type per IETF RFC 2698, which provides two rates: (1) Committed information rate (CIR) is the guaranteed rate per Dest-Prio. Note that the CIR must not be oversubscribed, i.e., the sum of CIR of all VoQs must not exceed the outgoing port rate; (2) Peak information rate (“PIR”) is the maximum allowed rate per a Dest-Prio. Note that when PIR is larger than CIR, only the rate CIR is guaranteed, while PIR-CIR (a.k.a., excess information rate, EIR) is not, and would be provided only if there are free resources available. For example, when some VoQs on a port do not fully utilize their CIR, the unused BW (a.k.a., excess BW) could be allocated to other VoQs on that port.
A shaper is further configured with two additional parameters: (1) Committed burst size (CBS) is the guaranteed burstiness (2) Peak burst rate (PBS) is the maximum burstiness. A VoQ whose traffic is “bursty” by nature would require larger CBS and PBS values to effectively pass its traffic across the switch fabric.
A shaper may be implemented by using two token buckets, wherein a CIR (PIR) bucket accumulates “tokens” or bits at a rate of CIR (PIR), up to CBS (PBS), respectively.
A scheduler is used to schedule traffic arriving at multiple VoQs, with the following precedence: (1) VoQs with higher priority—which are referred to as “strict priority for H over L priority”; (2) VoQs within its CIR limits, and configurable according to a so-called committed weighting among multiple such VoQs; (3) VoQs within its PIR limits, and configurable according to a so-called excess weighting among multiple such VoQs.
Source TM
FIG. 2 illustrates a reference TM scheme for an outgoing fabric port at a source. The source maintains a VoQ and a shaper per egress fabric port per priority. The shapers are connected to the fabric port scheduler. A VoQ can connect multiple Service VoQs. Each Service VoQ undergoes optional shaping at the Service Shaper, and is then being scheduled by the Service Scheduler into the VoQ for fairness and prioritization.
The VoQ and the shaper are adaptive, so that their configuration may be tuned based on congestion report messages (hereafter flow control messages, FC messages) that are generated by the switch fabric and broadcasted to all sources. An FC message would typically contain an indication/command for each egress fabric VoQ. The FC message rate may be limited to a maximum value, so as to avoid these messages from consuming too much of the system resources. On the other hand, a minimum FC message rate may be maintained, even if there are no congestion state changes.
The fabric port scheduler handles traffic as described at the QoS & TM section. That is, it first schedules H VoQs and then L VoQs. Among VoQs of same priority, it first schedules (with configurable weight per VoQ) those that are within their CIR limit, and then those that are within their PIR limits (with configurable weight per VoQ).
Switch Fabric TM
FIG. 3 illustrates a reference TM scheme for the switch fabric. The switch fabric maintains two fabric VOQs per an outgoing fabric port. One for H priority (H fabric VoQ) and the other for L priority (L fabric VoQ). The two fabric VoQs are connected to the fabric port scheduler, which schedules the H fabric VoQ with strict priority (SP) over the L fabric VoQ. Namely, the L fabric VoQ is allowed to transmit only when no packets queued at the H fabric VoQ, thus providing smaller delay to H VoQs, as is usually desired, though other scheduling schemes could also be used.
The switch fabric has also two shared buffer pools (H and L fabric pools, for H and L priority, respectively). These pools are maintained per switch fabric, rather than per fabric port. The switch fabric would first try to queue a packet at the appropriate fabric VoQ, which has pre-assigned dedicated (guaranteed) memory buffers. If there is not enough space there, it would try to queue the packet at the pool, and if there is not enough space there either, it would discard the packet.
The FC block monitors the packet buffer consumption of the fabric VoQs and pools, and upon congestion, as indicated by crossing the buffering threshold, would generate and broadcast FC messages to the ingress line cards.
It should be understood that switch fabrics in general and particularly those realized by using an off-the-shelf packet switches, rely on built-in (internal) packet memories, in order to achieve high capacity switching. In order to reduce cost and space, these memories are typically extremely small, and accordingly so are the fabric VoQs and pools. As a matter of fact, the switch fabric memory could be three orders of magnitude smaller than the packet memory maintained by a single line card. This requires a highly efficient and accurate congestion control algorithm.
There are a number of known prior art solutions which try to solve similar problems of traffic management at packet switches.
US 2001026551 describes an arrangement and a method for controlling a flow of signals. The flow includes a number of information packets in a communications network, e.g. an ATM-network. The arrangement includes a device for separating the signals in the first traffic signals from second traffic signals. The first traffic signals are signals that have a higher proportion of guaranteed resources, i.e. bandwidth, than the second traffic signals. The first traffic signals are also given a lower priority than the second traffic signals. The first and second traffic signals are handled separately. The feedback arrangement in that solution is so-called ABR arrangement which is rate based and assumes loss of packets.
US 2002075883 describes a switch fabric for routing data, which has a switching stage configured between an input stage and an output stage. The input stage forwards the received data to the switching stage, which routes the data to the output stage, which transmits the data towards destinations. In one aspect, at least one input port can be programmably configured to store data in two or more input routing queues that are associated with a single output port, and at least one output port can be programmably configured to receive data from two or more output routing queues that are associated with a single input port. In another aspect, the output stage transmits status information about the output stage to the input stage, which uses the status information to generate bids to request connections through the switching stage. In yet another aspect, the switching stage transmits a grant/rejection signal to the input stage identifying (1) whether each bid is accepted or rejected and, if rejected, (2) a reason for rejecting the bid, and the input stage determines how to react to a rejected bid based on the reason the bid was rejected.
The above-described solution is quite complex since it requires generation of bids, requesting connections and negotiation of grants/rejections for putting data via the switch.
U.S. Pat. No. 7,133,399 describes a centralized arbitration mechanism wherein a router switch fabric is configured in a consistent fashion. Remotely distributed packet forwarding modules determine which data chunks are ready to go through the optical switch and communicate this information to a central arbiter. Each packet forwarding module has an ingress ASIC containing packet headers in roughly four thousand virtual output queues. Algorithms choose at most two chunk requests per chunk period to be sent to the arbiter, which queues up to roughly 24 requests per output port. Requests are sent through a Banyan network, which models the switch fabric and scales on the order of N log N, where N is the number of router output ports. Therefore a crossbar switch function can be modeled up to the 320 output ports physically in the system, and yet have the central arbiter scale with the number of ports in a much less demanding way. An algorithm grants at most two requests per port in each chunk period and returns the grants to the ingress ASIC. Also for each chunk period the central arbiter communicates the corresponding switch configuration control information to the switch fabric. Still, the above solution requires the arbitration mechanism, the central arbiter, and sending requests to the arbiter for obtaining grants.
U.S. Pat. No. 6,714,517 discloses a packet-switched communication network which provides a guaranteed minimum bandwidth between pairs of Packet Switches, by defining Service Level Agreements (SLAs). An SLA is defined by at least a source identifier, a destination identifier and a minimum data rate, although other information may also be used. Upon arrival at certain networked nodes, packets are classified according to an SLA by reading the source and destination addresses in the packet. Once classified, the packets are placed in a queue and scheduled for transmission. A scheduler ensures that packets are transmitted at the minimum defined data rate for the SLA. The scheduler may use a statistical multiplexing method, such as deficit round robin, or deficit golden ratio. The deficit golden ratio method assures a minimum rate to packets for a particular SLA, but minimizes jitter and delay. Further, the solution implements congestion control that does not require nodes to be entirely turned off in congested conditions. However, the solution is not intended for a switch fabric assembly, as it handles queue congestion caused by only a single source sending to that queue, while a switch fabric is generally required to handle congestion caused by multiple sources.
In summary, neither of the above-mentioned prior art solutions achieves the objectives of an efficient and accurate congestion control algorithm as formulated below, simultaneously and cost effectively.