§1.1 Field of the Invention
The present invention concerns communications. In particular, the present invention concerns large scale switches used in communications networks.
§1.2 Background Information
To keep pace with Internet traffic growth, researchers continually explore transmission and switching technologies. For instance, it has been demonstrated that hundreds of signals can be multiplexed onto a single fiber with a total transmission capacity of over 3 Tbps and an optical cross-connect system (OXC) can have a total switching capacity of over 2 Pbps. However, today's core Internet Protocol (IP) routers' capacity remains at a few hundred Gbps, or a couple Tbps in the near future.
It still remains a challenge to build a very large IP router with a capacity of tens Tbps or more. The complexity and cost of building such a large-capacity router is much higher than building an OXC. This is because packet switching may require processing (e.g., classification and table lookup), storing, and scheduling packets, and performing buffer management. As the line rate increases, the processing and scheduling time available for each packet is proportionally reduced. Also, as the router capacity increases, the time for resolving output contention becomes more constrained.
Demands on memory and interconnection technologies are especially high when building a large-capacity packet switch. Memory technology very often becomes a bottleneck of a packet switch system. Interconnection technology significantly affects a system's power consumption and cost. As a result, designing a good switch architecture that is both scalable, to handle a very large capacity, and cost-effective remains a challenge.
The numbers of switch elements and interconnections are often critical to the scalability and cost of a switch fabric. Since the number of switch elements of single-stage switch fabrics is proportional to the square of the number of switch ports, single-stage switch fabric architectures are not attractive for large switches. On the other hand, multi-stage switch architectures, such as a Clos network for example, are more scalable and require fewer switch elements and interconnections, and are therefore more cost-effective.
FIG. 1 shows a core router (CR) architecture 100 which includes line cards 110,120 a switch fabric 130, and a route controller (not shown) for executing routing protocols, maintenance, etc. The router 100 has up to N ports and each port has one line card. (Note though that some switches have ports that multiplex traffic from multiple input line cards at the ingress and de-multiplex the traffic from the switch fabric to multiple line cards at the egress.) A switch fabric 130 usually includes multiple switch planes 140 (e.g., up to p) to accommodate high-speed ports.
A line card 110,120 usually includes ingress and/or egress functions and may include one or more of a transponder (TP) 112,122, a framer (FR) 114,124, a network processor (NP) 116,126, and a traffic manager (TM) 118,128. A TP 112,122 may be used, for, example, to perform optical-to-electrical signal conversion and serial-to-parallel conversion at the ingress side. At the egress side, it 112,122 may be used, for example, to perform parallel-to-serial conversion and electrical-to-optical signal conversion. An FR 114,124 may be used, for example, to perform synchronization, frame overhead processing, and cell or packet delineation. An NP 116,126 may be used, for example, to perform forwarding table lookup and packet classification. Finally, a TM 118,128 may be used, for example, to store packets and perform buffer management, packet scheduling, and any other functions performed by the router architecture (e.g., distribution of cells or packets in a switching fabric with multiple planes).
Switch fabric 130 may be used to deliver packets from an input port to a single output port for unicast traffic, and to multiple output ports for multicast traffic.
When a packet arrives at CR 100, it 100 determines an outgoing line to which the packet is to be transmitted. Variable length packets may be segmented into fixed-length data units, called “cells” without loss of generality, when entering CR 100. The cells may be reassembled into packets before they leave CR 100. Packet segmentation and reassembly is usually performed by NP 116,126 and/or TM 118,128.
FIG. 2 illustrates a multi-plane multi-stage packet switch architecture 200. The switch fabric 230 may include p switch planes 240. In this exemplary architecture 200, each plane 240 is a three-stage Benes network. Modules in the first, second, and third stages are denoted as Input Module (IM) 242, Center Module (CM) 244, and Output Module (OM) 246. IM 242, CM 244, and OM 246 often have many common features and may be referred to generally as a Switch Module (SM).
Traffic enters the switch 200 via an ingress traffic manager (TMI) 210 and leaves the switch 200 via an egress traffic manager (TME) 220. The TMI 210 and TME 220 can be integrated on a single chip. Therefore, the number of TM chips may be the same as the number of ports (denoted as N) in the system 200. Cells passing through the switch 200 via different paths may experience different queuing delays. However, if packets belonging to the same flow traverse the switch via the same path (i.e., the same switch plane and the same CM) until they have all left the switch fabric, there should be no cell out-of-sequence problem. FIG. 2 illustrates multiple paths between TMI(0) 210a and TME(0) 220a. The TMI 210 may determine the path ID (PID) of each flow using its flow ID (FID). The PID may correspond to a switch fabric plane 240 number and a CM 244 number in the plane 240.
In the embodiment 200 illustrated in FIG. 2, the first stage of a switch plane 240 includes k IMs 242, each of which has n inputs and m outputs. The second stage includes m CMs 244, each of which has k inputs and k outputs. The third stage includes k OMs 246, each of which has m inputs and n outputs. If n, m, and k are equal to each other, the three modules 242,244,246 may have identical structures.
From the TMI 210 to the TME 220, a cell traverses four internal links: (i) a first link from a TMI 210 to an IM 242; (ii) a second link from the IM 242 to a CM 244; (iii) a third link from the CM 244 to an OM 246; and (iv) a fourth link from the OM 246 to a TME 220.
In such a switch 200, as well as other switches, a number of issues may need to be considered. Such issues may include supporting multicast. Section 1.2.1 introduces the need for multicasting.
§1.2.1 Cell and Flow Level Multicasting
Multicasting may involve sending a packet from one point (or multiple points) to multiple points. In the context of a switch or router, multicasting may involve sending a packet or cell from one input port to multiple output ports.
Traditionally a multicast function has been implemented using a multicast bitmap in the cell header (i.e., at the cell level) or using a multicast table in the switch fabric (i.e., at the flow level). However, these two approaches do not work well in some large systems as explained below.
Implementing multicasting at the cell level doesn't work well in some large systems because the required bitmap size may be too big to carry in the cell header. For example, if the number of ports is 4096 and multicasting is performed in two stages, the bitmap size should be 128 bits (64 bits in each stage). For example, in a 40-Tb/s system such as that described in the '733 provisional, the required bitmap size would be 128 bits in the cell header (64 bits for the CM and 64 bits for the OM), which is larger than the 96-bit cell header.
The flow level approach doesn't work well with some large systems because the required multicast table size is too large to implement using (year 2003) state-of-the-art VLSI technology because the number of flows the multicast table should maintain requires too much memory space to be practical. For example, if the number of ports is 4096 and each port maintains up to 100 multicast flows, and the number of CMs is 64, the number of flows going through an OM can be 26,214,400 (=64×4096×100) and the required memory size for the multicast table is 1.6 Gbit. More generally, since each OM receives a packet from any TMI through any CM in the same plane, the number of flows is 4096*64*X, where X is the number of multicast flows from one TMI to the OM through the CM. Even if it is assumed that X is equal to 1, each OM should support 256 k multicast flows, leading to 16-Mbit memory size, which is too challenging with current (year 2003) technology.
In view of the foregoing, a new multicasting approach suitable for the multi-plane multi-stage switch architecture would be useful.