§1.1 Field of the Invention
The present invention concerns communications. In particular, the present invention concerns maintaining packet sequence with load balancing, and avoiding head-of-line (HOL) blocking in large scale switches used in communications networks.
§1.2 Related Art
To keep pace with Internet traffic growth, researchers continually explore transmission and switching technologies. For instance, it has been demonstrated that hundreds of signals can be multiplexed onto a single fiber with a total transmission capacity of over 3 Tbps and an optical cross-connect system (OXC) can have a total switching capacity of over 2 Pbps. However, the capacity of today's (Year 2003) core Internet Protocol (IP) routers remains at a few hundred Gbps, or a couple Tbps in the near future.
It still remains a challenge to build a very large IP router with a capacity of tens Tbps or more. The complexity and cost of building such a large-capacity router is much higher than building an optical cross connect system (OXC). This is because packet switching may require processing (e.g., classification and table lookup), storing, and scheduling packets, and performing buffer management. As the line rate increases, the processing and scheduling time available for each packet is proportionally reduced. Also, as the router capacity increases, the time for resolving output contention becomes more constrained.
Demands on memory and interconnection technologies are especially high when building a large-capacity packet switch. Memory technology very often becomes a bottleneck of a packet switch system. Interconnection technology significantly affects a system's power consumption and cost. As a result, designing a good switch architecture that is both scalable to handle a very large capacity and cost-effective remains a challenge.
The numbers of switch elements and interconnections are often critical to the switch's scalability and cost. Since the number of switch elements of single-stage switches is proportional to the square of the number of switch ports, single-stage architecture is not attractive for large switches. On the other hand, multi-stage switch architectures, such as a Clos network type switch, is more scalable and requires fewer switch elements and interconnections, and is therefore more cost-effective.
FIG. 1 shows a core router (CR) architecture 100 which includes line cards 110, 120 a switch fabric 130, and a route controller (not shown) for executing routing protocols, maintenance, etc. The router 100 has up to N ports and each port has one line card. (Note though that some switches have ports that multiplex traffic from multiple input line cards at the ingress and de-multiplexes the traffic from the switch fabric to multiple line cards at the egress.) A switch fabric 130 usually includes multiple switch planes 140 (e.g., up to p in the example of FIG. 1) to accommodate high-speed ports.
A line card 110, 120 usually includes ingress and/or egress functions and may include one or more of a transponder (TP) 112, 122, a framer (FR) 114, 124, a network processor (NP) 116, 126, and a traffic manager (TM) 118, 128. A TP 112, 122 may be used to perform optical-to-electrical signal conversion and serial-to-parallel conversion at the ingress side. At the egress side, it 112, 122 may be used to perform parallel-to-serial conversion and electrical-to-optical signal conversion. An FR 114, 124 may be used to perform synchronization, frame overhead processing, and cell or packet delineation. An NP 116, 126 may be used to perform forwarding table lookup and packet classification. Finally, a TM 118, 128 may be used to store packets and perform buffer management, packet scheduling, and any other functions performed by the router architecture (e.g., distribution of cells or packets in a switching fabric with multiple planes).
Switch fabric 130 may be used to deliver packets from an input port to a single output port for unicast traffic, and to multiple output ports for multicast traffic.
When a packet arrives at CR 100, it determines an outgoing line to which the packet is to be transmitted. Variable length packets may be segmented into fixed-length data units, called “cells” without loss of generality, when entering CR 100. The cells may be re-assembled into packets before they leave CR 100. Packet segmentation and reassembly is usually performed by NP 116, 126 and/or TM 118, 128.
FIG. 2 illustrates a multi-plane multi-stage packet switch architecture 200. The switch fabric 230 may include p switch planes 240. In this exemplary architecture 200, each plane 240 is a three-stage Benes network. Modules in the first, second, and third stages are denoted as Input Module (IM) 242, Center Module (CM) 244, and Output Module (OM) 246. IM 242, CM 244, and OM 246 have many common features and may be referred to generally as a Switch Module (SM).
Traffic enters the switch 200 via an ingress traffic manager (TMI) 210 and leaves the switch 200 via an egress traffic manager (TME) 220. The TMI 210 and TME 220 can be integrated on a single chip. Therefore, the number of TM chips may be the same as the number of ports (denoted as N) in the system 200. Cells passing through the switch 200 via different paths may experience different queuing delays. These different delays may result in cells arriving at a TME 220 out of sequence. However, if packets belonging to the same flow traverse the switch via the same path (i.e., the same switch plane and the same CM) until they have all left the switch fabric, there should be no cell out-of-sequence problem. FIG. 2 illustrates multiple paths between TMI(0) 210a and TME(0) 220a. The TMI 210 may determine the path ID (PID) of each flow using a flow ID (FID). The PID may correspond to a switch fabric plane 240 number and a CM 244 number in the plane 240.
In the embodiment 200 illustrated in FIG. 2, the first stage of a switch plane 240 includes k IMs 242, each of which has n inputs and m outputs. The second stage includes m CMs 244, each of which has k inputs and k outputs. The third stage includes k OMs 246, each of which has m inputs and n outputs. If n, m, and k are equal to each other, the three modules 242, 244, 246 may have identical structures.
From the TMI 210 to the TME 220, a cell traverses four internal links: (i) a first link from a TMI 210 to an IM 242; (ii) a second link from the IM 242 to a CM 244; (iii) a third link from the CM 244 to an OM 246; and (iv) a fourth link from the OM 246 to a TME 220.
In such a switch 200, as well as other switches, a number of issues may need to be considered. Such issues may include maintaining packet sequence, load balancing and HOL blocking. Section 1.2.1 discusses packet out-of-sequence and load balancing problems. Section 1.2.2 discusses the problem of HOL blocking.
§1.2.1 Packet Out-Of-Sequence and Load Balancing
A switch fabric cross-connects packets from an input port (i.e., packet arriving port) to an output port (i.e., packet departing port) at very high speed (e.g., new configuration in every 200 nsec). One requirement, or at least an important feature, of a switch fabric is that packets belonging to the same flow be delivered in order. A flow refers to a virtual connection from a source end system to a destination end system. In other words, a flow is a stream of data traveling across a network between two endpoints. An example of a flow is a stream of packets traveling between two computers that have established a TCP connection. If packets belong to the same flow are not delivered in order through the switch fabric, the switch fabric is assumed to have a packet out-of-sequence problem. Although some applications may be tolerant of packet out-of-sequence problems, it is desirable to avoid such problems.
A switch fabric can be classified as one of (a) a single-path switch fabric, or (b) a multi-path switch fabric. A single-path switch fabric has only one path for a given input port-output port pair. A single-path switch fabric avoids packet out-of-sequence problems because all packets of a flow arriving at a given input port take the same path through the switch. However, a single-path switch fabric may not be scalable to meet the increasing demand of the Internet traffic.
A multi-path switch fabric, such as the one 230 illustrated in FIG. 2 for example, has more than one path for an input port-output port pair. A multi-path switch fabric can be further classified as either (a) a memory-less switch fabric, or (b) a buffered switch fabric. A memory-less switch fabric does not store packets in the switch fabric. Therefore, a memory-less multi-path switch fabric should have no packet out-of-sequence problems, or at least no severe packet out-of-sequence problems, because the propagation delays through different paths are comparable. However, for a switch fabric to be memory-less, the potential for contention among packets destined for the same output link must be resolved before the packet enters the switch fabric. Unfortunately, this could be a complicated process.
A buffered multi-path switch fabric may have a packet out-of-sequence problem because packets sent to different paths may experience different queuing delays due to the output link contentions. Two known techniques for solving this problems, as well as shortcomings of these known techniques, are introduced in §1.2.1.1 below.
§1.2.1.1 Previous Approaches to Solve Packet Out-Of-Sequence Problems in Buffered Multi-Path Switch Fabrics, and Limitations of Such Approaches
Two methods have been proposed to solve the packet out-of-sequence problem in the buffered multi-path switch fabric. The first method re-sequences packets at the output port. The packet re-sequencing may require several conditions. First, each packet should carry a sequence number. One exemplary sequence number is a time-stamp based on the arrival time of the packet at the input port. If the sequence number is large, the overhead ratio (of sequence number size to cell or packet size) can be too big to be practical. A high overhead ratio can cause increased implementation costs, performance degradation due to reduced internal speedup, or both. Second, the degree of packet out-of-sequence should be bounded to ensure successful re-sequencing. Since Internet traffic is very complicated, it is difficult to estimate the degree of packet out-of-sequence that will occur. Even when the degree of packet out-of-sequence is bounded, implementing the re-sequencing circuits increases costs.
The second method to solve the packet out-of-sequence problem is to send all packets belong to the same flow over the same path. This emulates a single-path switch fabric for a given flow, thus avoiding packet out-of-sequence problems altogether. This idea is attractive in the sense that the packet out-of-sequence problem is only matters for the packets belong to the same flow. This scheme is referred to as “static hashing.” Static hashing advantageously eliminates the re-sequencing buffer at the output port. Since packets belonging to the same flow take the same path in the multi-path switch fabric, they will arrive at the output port in the proper sequence.
Note that re-sequencing is different from re-assembly. Re-sequencing is a term used to describe an operation to correct the situation when packets belonging to the same flow arrive at the output port out-of-sequence. Re-assembly is a term used to describe reconstituting packets when the packets are segmented into cells and are interleaved in the switch fabric. For purposes of this discussion, it is assumed that packets are not interleaved in the switch fabric. In other words, all cells belonging to the same packet will be sent back-to-back, without any intervening cells. Therefore, with static hashing, the output port has no re-sequencing buffer, nor does it have a re-assembly buffer.
One problem of the static hashing scheme is the potential for load imbalance. Since each flow may have different bandwidth, it is possible that one path will be more congested than another path, or other paths. This may complicate choosing proper paths to route packets from an input port to an output port. If paths are not properly chosen, the probability of congesting one path increases, adversely impacting switch performance.
To summarize, since a multi-path buffered packet switch has multiple paths from an input port, to an output port, there can be packet out-of-sequence problems. If packets of the same flow take the same path (as in static hashing), the packet order is maintained. However, the load on each path might not be balanced. On the other hand, if packets of the same flow take different paths, there can be an out-of-sequence problem between packets. One way to overcome this problem is to have a re-sequence buffer at the egress line card. However, adding resequencing functionality adds costs, and in a large system, the degree of out-of-sequence could be too large to re-sequence. In view of the foregoing, improved techniques for maintaining packet sequence in switches is desired.
§1.2.2 HOL Blocking
If one queue contains cells with different destinations, there can be a head-of-line (HOL) blocking problem. That is, an HOL cell losing arbitration for a contested output port can block cells behind of it, even if those cells are destined for an idle (uncontested) output port.
§1.2.2.1 Previous Approaches to Solve HOL Blocking and Limitations of Such Approaches
The following example focuses on packets at an input port of a multi-plane multi-stage switch fabric, such as that 200 of FIG. 2, and serves to illustrate the limits of using general queues, virtual path queues (VPQs) and virtual output queues (VOQs) to eliminate the possibility of HOL blocking. Packets arriving at the switch can be destined for any output port, they can have any class of service, and they can be routed through any path in the switch fabric. Therefore, the number of queues required to ensure that HOL blocking is completely eliminated is equal to the switch size (i.e., the number of output ports), multiplied by the number of scheduling priorities (i.e., the number of different classes of service supported by the switch), and further multiplied by the number of possible paths for an {input port, output port} pair. In this example, the number of queues required would be q*p*m*n*k.
Unfortunately, if a switch fabric has a large number of paths for an {input port, output port} pair, the number of required queues at the input port may be too large to be practical. Recall that in the multi-plane multi-stage switch fabric shown in FIG. 2, the required number of queues necessary to completely eliminate the possibility of HOL blocking is p*q*n*k*m. Thus, for example, if p=8, q=2, n=m=k=64, then the required number of queues becomes 4 million, which may be too large for a practical implementation.
If it is assumed that the input port has only queues corresponding to the output ports and the scheduling priority (i.e., if it is assumed that the input port has a virtual output queue (VOQ) structure), packets routed to different paths can be stored at the same VOQ. Therefore, if the HOL packet is routed to a congested path, the HOL packet will block the packets behind of it and prevent them from entering the switch fabric. Consequently, packets routed to another path that is idle can be blocked by the HOL packet routed to the congested path. This HOL blocking degrades the system throughput.
If, on the other hand, it is assumed that the input port has only queues corresponding to the path and the scheduling priority (i.e., if it is assumed that the input port has a virtual path queue (VPQ) structure), packets destined for different output ports can be stored at the same VPQ. If the HOL packet is destined for a “hot spot” output port and the HOL packet loses a contention, the HOL packet will block the packets behind of it and prevent them from entering the switch fabric. Consequently, packets destined for other ports that are idle can be blocked by the HOL packet destined for the hot spot output port. This HOL blocking degrades system throughput.
In view of the foregoing, improved techniques for avoiding HOL blocking, that don't require too many queues are needed.