§ 1.1 Field of the Invention
The present invention concerns communications. In particular, the present invention concerns large scale switches used in communications networks.
§ 1.2 Background Information
To keep pace with Internet traffic growth, researchers continually explore transmission and switching technologies. For instance, it has been demonstrated that hundreds of signals can be multiplexed onto a single fiber with a total transmission capacity of over 3 Tbps and an optical cross-connect system (OXC) can have a total switching capacity of over 2 Pbps. However, today's core Internet Protocol (IP) routers' capacity remains at a few hundred Gbps, or a couple Tbps in the near future.
It still remains a challenge to build a very large IP router with a capacity of tens Tbps or more. The complexity and cost of building such a large-capacity router is much higher than building an OXC. This is because packet switching may require processing (e.g., classification and table lookup), storing, and scheduling packets, and performing buffer management. As the line rate increases, the processing and scheduling time available for each packet is proportionally reduced. Also, as the router capacity increases, the time for resolving output contention becomes more constrained.
Demands on memory and interconnection technologies are especially high when building a large-capacity packet switch. Memory technology very often becomes a bottleneck of a packet switch system. Interconnection technology significantly affects a system's power consumption and cost. As a result, designing a good switch architecture that is both cost-effective and scalable to have a very large capacity remains a challenge.
The numbers of switch elements and interconnections are often critical to the scalability and cost of a switch fabric. Since the number of switch elements of single-stage switch fabrics is proportional to the square of the number of switch ports, single-stage switch fabric architectures are not attractive for large switches. On the other hand, multi-stage switch architectures, such as a Clos network for example, are more scalable and require fewer switch elements and interconnections, and are therefore more cost-effective.
FIG. 1 shows a core router (CR) architecture 100 which includes line cards 110,120 a switch fabric 130, and a route controller (not shown) for executing routing protocols, maintenance, etc. The router 100 has up to N ports and each port has one line card. (Note though that some switches have ports that multiplex traffic from multiple input line cards at the ingress and de-multiplexes the traffic from the switch fabric to multiple line cards at the egress.) A switch fabric 130 usually includes multiple switch planes 140 (e.g., up to p) to accommodate high-speed ports.
A line card 110,120 usually includes ingress and/or egress functions and may include one or more of a transponder (TP) 112,122, a framer (FR) 114,124, a network processor (NP) 116,126, and a traffic manager (TM) 118,128. A TP 112,122 may be used, for example, to perform optical-to-electrical signal conversion and serial-to-parallel conversion at the ingress side. At the egress side, it 112,122 may be used, for example, to perform parallel-to-serial conversion and electrical-to-optical signal conversion. An FR 114,124 may be used, for example, to perform synchronization, frame overhead processing, and cell or packet delineation. An NP 116,126 may be used, for example, to perform forwarding table lookup and packet classification. Finally, a TM 118,128 may be used, for example, to store packets and perform buffer management, packet scheduling, and any other functions performed by the router architecture (e.g., distribution of cells or packets in a switching fabric with multiple planes).
Switch fabric 130 may be used to deliver packets from an input port to a single output port for unicast traffic, and to multiple output ports for multicast traffic.
When a packet arrives at CR 100, it 100 determines an outgoing line to which the packet is to be transmitted. Variable length packets may be segmented into fixed-length data units, called “cells” without loss of generality, when entering CR 100. The cells may be reassembled into packets before they leave CR 100. Packet segmentation and reassembly is usually performed by NP 116,126 and/or TM 118,128.
FIG. 2 illustrates a multi-plane multi-stage packet switch architecture 200. The switch fabric 230 may include p switch planes 240. In this exemplary architecture 200, each plane 240 is a three-stage Benes network. Modules in the first, second, and third stages are denoted as Input Module (IM) 242, Center Module (CM) 244, and Output Module (OM) 246. IM 242, CM 244, and OM 246 often have many common features and may be referred to generally as a Switch Module (SM).
Traffic enters the switch 200 via an ingress traffic manager (TMI) 210 and leaves the switch 200 via an egress traffic manager (TME) 220. The TMI 210 and TME 220 can be integrated on a single chip. Therefore, the number of TM chips may be the same as the number of ports (denoted as N) in the system 200. Cells passing through the switch 200 via different paths may experience different queuing delays. However, if packets belonging to the same flow traverse the switch via the same path (i.e., the same switch plane and the same CM) until they have all left the switch fabric, there should be no packet out-of-sequence problem. FIG. 2 illustrates multiple paths between TMI(0) 210a and TME(0) 220a. The TMI 210 may determine the path ID (PID) of each flow using its flow ID (FID). The PID may correspond to a switch fabric plane 240 number and a CM 244 number in the plane 240.
In the embodiment 200 illustrated in FIG. 2, the first stage of a switch plane 240 includes k IMs 242, each of which has n inputs and m outputs. The second stage includes m CMs 244, each of which has k inputs and k outputs. The third stage includes k OMs 246, each of which has m inputs and n outputs. If n, m, and k are equal to each other, the three modules 242,244,246 may have identical structures.
From the TMI 210 to the TME 220, a cell traverses four internal links: (i) a first link from a TMI 210 to an IM 242; (ii) a second link from the IM 242 to a CM 244; (iii) a third link from the CM 244 to an OM 246; and (iv) a fourth link from the OM 246 to a TME 220.
In such a switch 200, as well as other switches, a number of issues may need to be considered. Such issues may include packet reassembly and deadlock avoidance.
Section 1.2.1 introduces the need for packet reassembly, as well as known packet reassembly techniques and their limitations.
§ 1.2.1 Packet Reassembly
When building a packet switch, it is a common practice to segment each arriving packet into multiple fixed-length cells (e.g., 64 bytes) at the input port, pass them through the switch fabric, and reassemble them back into packets with reassembly queues (RAQs) at the output port.
Cells may be classified into four categories: Beginning of Packet (BOP) cells; End of Packet (EOP) cells; Continue of Packet (COP) cells; and Single Cell Packet (SCP) cells. A BOP cell is the first cell of a packet. An EOP cell is the last cell of a packet. A COP cell is a cell between a BOP cell and an EOP cell. An SCP cell is a packet whose size is equal to or smaller than the cell payload size (e.g., 52 Bytes).
When cells are routed through the switch fabric, if more than one packet is contending for the same output link, and if output port contention arbitration is performed on a per cell basis rather than on a per packet basis, the cells can be interleaved in the switch fabric. Consequently, the output port may receive many partial packets and may need to store the partial packets until the last cell of the packet (i.e., EOP cell) arrives at the output port so that the packet can be reassembled from its constituent cells.
A cell is transferred over a link (such as one of the four internal links listed in § 1.2 above) from a queue at the upstream side to a queue at the downstream side. The term source queue (SQ) is used to denote the queue at the upstream side of a link, and the term destination queue (DQ) is used to denote the queue at the downstream side of a link.
Cells waiting at SQs attached to the same output link compete with each other. In the switch fabric described above, one link can send at most one cell in each cell time slot. If more than one cell is waiting at the SQs associated with the output link, an arbiter associated with the link should choose one of them for transmission in the next time slot and all the other cells have to wait at the SQs until they win the contention (assuming there are still other cells competing for the desired outgoing link).
This section explains scheduling algorithms from the perspective of the output link. Output links of TMI, IM, CM, and OM may have the same scheduling policy. One link has multiple SQs where cells are queued to be transmitted to multiple DQs in the next stage. The challenge is to deliver cells from the SQ to the DQ so that cell sequence integrity is maintained, while also providing high throughput and fairness.
§ 1.2.1.1 Previous Approaches and Their Limitations
FIG. 3 shows one possible scheduling scheme. The SQs are labeled A, B, and C, while the DQs are labeled X, Y, and Z. In this example, SQ(A) stores a three-cell packet destined for DQ(X), SQ(B) stores a two-cell packet destined for DQ(X), and SQ(C) stores a four-cell packet destined for DQ(Z).
As illustrated in FIG. 3, a simple way to send packets from SQ to DQ is to schedule “cells” in a round-robin fashion. The switch fabric can interleave cells without consideration of packet boundary. That is, regardless of the cell type (i.e., BOP, COP, EOP, or SCP), the switch fabric can interleave cells in round robin fashion. This scheme is referred to as the complete cell interleaving (CCI) scheduling scheme. The required number of reassembly queues (RAQs) in the CCI scheduling scheme is equal to the switch size (i.e., the number of input ports) multiplied by the number of scheduling priorities and the number of possible paths for a pair of input port and output port.
FIG. 4 is a flow diagram of an exemplary method 400 that may be used to implement the CCI scheduling scheme. Assume there are 64 SQs in a case with single priority and unicast mode. (If two priorities and both unicast and multicast are supported, the number of SQs becomes 256=64×2×2.) A counter is initialized (e.g., set to 0), and an index is set to a round-robin (RR) pointer. (Block 410) The arbiter scans 64 SQs beginning from the queue indicated by RR. More specifically, if the SQ is not empty and the DQ of the HOL cell at the SQ has a (or enough) free space (i.e., eligible?=YES), the HOL cell is sent over the link and the RR pointer of the arbiter is updated to the next SQ. (Blocks 430, 450, 460, 470) If the SQ is empty, the arbiter scans the next SQ until it finds a non-empty SQ with an eligible HOL cell. (Blocks 430, 440 and 450) With CCI, whether or not a HOL cell is eligible may be determined by checking whether or not the destination queue (DQ) has a (or enough) free space. This may be tracked using buffer and queue outstanding cell counters (BOC and QOC), and comparing those counts to source module (SM) buffer and queue size constants (B_sm and Q_sm) as described in the '733 provisional. In other words, if BOC is less than B_sm and QOC is less than Q_sm, the HOL cell is eligible. Otherwise it is not eligible.
If there is only one path for an input port-output port pair, the required number of RAQs is equal to the number of input ports multiplied by the number of scheduling priorities. Therefore, a virtual input queue (VIQ) can be used to reassemble the packet. This VIQ approach is adopted in many multi-plane single-stage switch fabrics, where the cells of a packet can be striped among the multiple planes.
The CCI scheduling scheme has a major drawback in that the number of reassembly queues (RAQs) can become very large in certain switch fabrics. Since cells are interleaved without any consideration of packet boundary, when they arrive at TME, they should be separated per packet. To ensure proper packet reassembly, the TME must have as many RAQs as the number of TMIs (i.e., N=n*k) multiplied by the number of scheduling priorities (i.e., q) and the number of possible paths between TMI and TME (i.e., p*m) For a multi-plane multi-stage switch such as the one illustrated in FIG. 2, the number of possible paths between an input-output pair is equal to the number of switch planes (i.e., p) multiplied by the number of center-stage switch modules (e.g., m in a Clos-network switch). Therefore, to ensure packet reassembly, the required number of RAQs in CCI scheduling scheme is p*q*n*k*m. For example, if p=8, q=2, n=m=k=64, then the required number of RAQs becomes 4 million queues, which is too large to be feasible.
As can be appreciated by the foregoing, although the CCI scheduling scheme has the best load-balancing among the possible paths and minimum cell transmission delays through the switch fabric (i.e., IM, CM, and OM), it may require too many queues at TME to reassemble the packet in large multi-plane, multi-stage switch fabrics.
In view of the foregoing, better packet scheduling and reassembly schemes are needed, particularly for large scale devices with multiple-stage, multiple switch plane switch fabrics. In any such scheme, deadlock situations should be avoided.