1. Field of the Invention
The present invention concerns communications. In particular, the present invention concerns structures of switch module cell memory, and preventing cell memory overflow in large scale switches used in communications networks.
2. Related Art
To keep pace with Internet traffic growth, researchers continually explore transmission and switching technologies. For instance, it has been demonstrated that hundreds of signals can be multiplexed onto a single fiber with a total transmission capacity of over 3 Tbps and an optical cross-connect system (OXC) can have a total switching capacity of over 2 Pbps. However, the capacity of today's (Year 2003) core Internet Protocol (IP) routers remains at a few hundred Gbps, or a couple Tbps in the near future.
It still remains a challenge to build a very large IP router with a capacity of tens Tbps or more. The complexity and cost of building such a large-capacity router is much higher than building an optical cross connect system (OXC). This is because packet switching may require processing (e.g., classification and table lookup), storing, and scheduling packets, and performing buffer management. As the line rate increases, the processing and scheduling time available for each packet is proportionally reduced. Also, as the router capacity increases, the time for resolving output contention becomes more constrained.
Demands on memory and interconnection technologies are especially high when building a large-capacity packet switch. Memory technology very often becomes a bottleneck of a packet switch system. Interconnection technology significantly affects a system's power consumption and cost. As a result, designing a good switch architecture that is both scalable to handle a very large capacity and cost-effective remains a challenge.
The numbers of switch elements and interconnections are often critical to the switch's scalability and cost. Since the number of switch elements of single-stage switches is proportional to the square of the number of switch ports, single-stage architecture is not attractive for large switches. On the other hand, multi-stage switch architectures, such as a Clos network type switch, is more scalable and requires fewer switch elements and interconnections, and is therefore more cost-effective.
FIG. 1 shows a core router (CR) architecture 100 which includes line cards 110,120 a switch fabric 130, and a route controller (not shown) for executing routing protocols, maintenance, etc. The router 100 has up to N ports and each port has one line card. (Note though that some switches have ports that multiplex traffic from multiple input line cards at the ingress and de-multiplexes the traffic from the switch fabric to multiple line cards at the egress.) A switch fabric 130 usually includes multiple switch planes 140 (e.g., up to p in the example of FIG. 1) to accommodate high-speed ports.
A line card 110,120 usually includes ingress and/or egress functions and may include one or more of a transponder (TP) 112,122, a framer (FR) 114,124, a network processor (NP) 116,126, and a traffic manager (TM) 118,128. A TP 112,122 may be used to perform optical-to-electrical signal conversion and serial-to-parallel conversion at the ingress side. At the egress side, it 112,122 may be used to perform parallel-to-serial conversion and electrical-to-optical signal conversion. An FR 114,124 may be used to perform synchronization, frame overhead processing, and cell or packet delineation. An NP 116,126 may be used to perform forwarding table lookup and packet classification. Finally, a TM 118,128 may be used to store packets and perform buffer management, packet scheduling, and any other functions performed by the router architecture (e.g., distribution of cells or packets in a switching fabric with multiple planes).
Switch fabric 130 may be used to deliver packets from an input port to a single output port for unicast traffic, and to multiple output ports for multicast traffic.
When a packet arrives at CR 100, it determines an outgoing line to which the packet is to be transmitted. Variable length packets may be segmented into fixed-length data units, called “cells” without loss of generality, when entering CR 100. The cells may be re-assembled into packets before they leave CR 100. Packet segmentation and reassembly is usually performed by NP 116,126 and/or TM 118,128.
FIG. 2 illustrates a multi-plane multi-stage packet switch architecture 200. The switch fabric 230 may include p switch planes 240. In this exemplary architecture 200, each plane 240 is a three-stage Benes network. Modules in the first, second, and third stages are denoted as Input Module (IM) 242, Center Module (CM) 244, and Output Module (OM) 246.1M 242, CM 244, and OM 246 have many common features and may be referred to generally as a Switch Module (SM).
Traffic enters the switch 200 via an ingress traffic manager (TMI) 210 and leaves the switch 200 via an egress traffic manager (TME) 220. The TMI 210 and TME 220 can be integrated on a single chip. Therefore, the number of TM chips may be the same as the number of ports (denoted as N) in the system 200. Cells passing through the switch 200 via different paths may experience different queuing delays if the switch fabric has a queuing buffer in it. These different delays may result in cells arriving at a TME 220 out of sequence. However, if packets belonging to the same flow traverse the switch via the same path (i.e., the same switch plane and the same CM) until they have all left the switch fabric, there should be no cell out-of-sequence problem. FIG. 2 illustrates multiple paths between TMI(0) 210a and TME(0) 220a. The TMI 210 may determine the path ID (PID) of each flow using a flow ID (FID). The PID may correspond to a switch fabric plane 240 number and a CM 244 number in the plane 240.
In the embodiment 200 illustrated in FIG. 2, the first stage of a switch plane 240 includes k IMs 242, each of which has n inputs and m outputs. The second stage includes m CMs 244, each of which has k inputs and k outputs. The third stage includes k Oms 246, each of which has m inputs and n outputs. If n, m, and k are equal to each other, the three modules 242,244,246 may have identical structures.
From the TMI 210 to the TME 220, a cell traverses four internal links: (i) a first link from a TMI 210 to an IM 242; (ii) a second link from the IM 242 to a CM 244; (iii) a third link from the CM 244 to an OM 246; and (iv) a fourth link from the OM 246 to a TME 220.
In such a switch 200, as well as other switches, a number of issues may need to be considered. Such issues may include buffering in (switch modules of) the switch fabric and flow control. Section 1.2.1 compares several buffering strategies at SMs and explains their shortcomings. Section 1.2.2 describes the need for flow control and limitations of known flow control schemes.
§ 1.2.1 Switch Module Memory Structures
A switch module (SM), such as IM 242, CM 244 and OM 246 for example, can be memory-less, or can buffer cells. Each of these options is introduced below.
A memory-less SM may require global coordination among all TMIs 210 to resolve any output port contention before a TMI advances a cell into the switch plane. The complexity of the global coordination can be very high and the arbitration time for such global coordination can be quite long. Accordingly, a switch fabric with memory-less SMs might not be feasible for a large scale, high-speed switch.
An SM may buffer cells in a number of ways. FIG. 3A illustrates a cross-point SM having a number of inputs 330, a number of outputs 340, and some means for arbitrating 320 cells contending for the same output port. Each cross point-may include a queue 310 for each priority. A queue may be thought of as a logical construct. That is, a number of logical queues can be combined on a physical memory. Indeed, it is possible to have logical queues provided as needed, such that a logical queue not currently needed reserves no memory. A buffer is a physical memory area. A memory may be partitioned into a number of buffers. An SM may use one of a number of buffering techniques, such as cross-point-buffered, shared-memory, output-buffered, input-buffered, etc. These buffering techniques are introduced below with reference to FIGS. 3B-3E. The memory size in the SM affects system performance. Generally, the larger the memory size, the better the performance. However, memory size is limited by VLSI technology.
§ 1.2.1.1 Memory Size
Still referring to FIG. 3A the following examples assume that the SM has 4096 cross-points (i.e., 64 inputs*64 outputs) and each cross-point has two (2) queues 310 (i.e., high and low scheduling priorities). Each of the two queues 310 corresponds to a scheduling priority level. Therefore, in this exemplary embodiment, each SM has 8192 queues 310. Each queue 310 can receive and send at most one cell in each time slot. An incoming cell (i.e., a cell arriving from one of the input ports 330) is stored at one of the queues 310 according to its destination (i.e., the one of the output ports 340 for which it is destined) and scheduling priority level.
One way to serve cells with different priorities is to use strict priority. If strict priority serving is used, cells in the lower priority queues are served only when there are no cells in the higher priority queues waiting to be served. An arbiter 320 chooses a cell to send to the next stage in every time slot using a packet-scheduling algorithm.
§ 1.2.1.2 Sizes of Queues
The size of each queue 310 should be equal to or greater than the round-trip time (RTT) between the source queue (SQ) (i.e., the queue at the upstream SM of a link) and the destination queue (DQ) (i.e., the queue at the downstream SM of a link). In addition to the RTT, the DQ should also accommodate alignment delay because cells may arrive at different phases of the DQ clocks.
Regarding alignment delays, as illustrated in FIG. 4, the clock signals on each TMI 210 can differ by as much as one time slot. At a specific time, one TMI 210a may be at the beginning of a time slot while another TMI 210b may be at the end of a time slot.
Regarding RTT delay, this delay is a function of distance between a source queue and a destination queue and the speed at which the cell can travel. In the following examples, it is assumed that a cell can travel at 5 nsec per meter because the propagation speed of an electromagnetic signal is known to be about 200,000 km per second.
Accordingly if the distance between the SQ and the DQ is 100 m (For example, an IM chip and a CM chip can be located at circuit packs of different racks, and the two racks can be placed at different floors in the same building or at different buildings.), the RTT is 1 psec (i.e., 100 m*2*5 nsec/m,). If one time slot is assumed to be 204.8 nsec (i.e., 512 bits/2.5 Gbps), the RTT is five (5) time slots. Thus, in this example, the queue size should hold at least five (5) cells. In addition, clock misalignments should be accounted for. Therefore, it is reasonable to assume that the RTT between the TMI 210 and the IM 242 could be as much as two (2) time slots. If the IM 242 and the OM 246 are on the same circuit pack, it is similarly reasonable to assume that the RTT between the OM 246 and the TME 220 could be four time slots. (In the foregoing example, the distance is the distance between IM chip and CM chip. It was assumed that TM chip and IM chip are on the same shelf, or at least on the same rack. Therefore, it is assumed that the distance between TMI and IM will be at most a few meters, which can be translated less than one (1) time slot. It was further assumed that the RTT between TMI and IM is two (2) time slots—one (1) time slot for the distance and another (1) time slot for the misalignment.)
If the distance between TMI 210 and IM 242 is less than a few meters (which should be the case if it is assumed that TMI 210 and IM 242 are on the same shelf), the queue size at IM 242 doesn't need to be large. Since one time slot is 204.8 nsec, one time slot corresponds to 41 m. Since the distance between TMI 210 and IM 242 is assumed to be less than 41 m, the minimum queue size at IM 242 is two (2) cells—one (1) cell for RTT and another (1) cell for misalignment.
The queue size at CM 244 and OM 246 should be big enough to accommodate the RTT and the timer misalignment. If the distance between IM 242 and CM 244 is 280 m (Note that although IM and CM are on the same switch plane, they can be put into different racks, which can be a few hundred meters apart.), the queue size at CM should be at least 15 cells (=14 cells for RTT and 1 cell for the timer misalignment). If IM 242 and OM 246 are on the same circuit pack, the queue size at OM 246 should also be at least 15 cells. (Note that in this example, OM is the DQ and CM is the SQ, and the distance between OM and CM may be the same as the distance between IM and CM.)
Having introduced the design of memory size and buffer size, various buffering techniques are now described in § 1.2.1.3 below.
§ 1.2.1.3 Buffering Techniques
In a cross-point-buffered SM such as the one illustrated in FIG. 3B, each queue has a dedicated memory space and no memory space is shared. Therefore number of buffers 350 is the same as the number of queues 310. A cross-point-buffered SM (e.g., having 64 inputs, 64 outputs, and two priorities) will require 8192 (i.e., 64 input*64 outputs*2 priorities) separate buffers 350. Assuming that that the RTT for one SQ is equivalent to 15 cells, and the memory size will be 122,880 (=8192 queue buffers×15 cells/queue) cells (which may be 64 Bytes each). This size memory is far beyond year 2003 state-of-the-art ASIC technology limit. Current (Year 2003) state-of-the-art technology limits the memory size implemented in a single chip to a few Mbits. If the cell size is 64 Bytes, one memory can contain a few thousand cells.
FIG. 3D illustrates an output-buffered SM in which 64 queues for 64 input ports are grouped to a buffer 370 associated with an output port. Thus, in each buffer, 64 queues share the memory space of the buffer. However, queues in different buffers do not share memory space. An output-buffered SM may require a large memory to accommodate the RTT propagation delay coming simultaneously from many senders. For example, if all 64 SQs are sending to the same DQ, the DQ size must be at least 960 cells (i.e., 15*64 cells) to prevent buffer overflow. Since a single memory may need to accommodate 128 buffers (128=64 ports*2 priority levels), the memory size must be at least 122,880 cells (i.e., 128 buffers*960 cells/buffer), which is the same with cross-point-buffered SM. If the buffer size of output-buffered SM is smaller than 960 cells, the senders should communicate each other in order to avoid buffer overflow. Such communication between senders makes the implementation difficult.
FIG. 3C illustrates a shared-memory SM, all queues share the whole memory space and there is only one buffer 360. Shared-memory SMs have two problems. First, the memory would have to write 64 cells and read 64 cells within a cell time slot. Since the time slot is 204.8 nsec, the memory cycle time would be 1.6 nsec (i.e., 204.8 nsec/(64 writes+64 reads)), which is too challenging for current (Year 2003) state-of-the-art technology. A second problem is ensuring that the queues don't overflow, which may require some sort of flow control.
As noted above, a design consideration for most buffered switch fabrics is flow control. More specifically, a buffer may become full for a large packet. If the buffer is full, the SM should discard cells or implement a handshaking scheme so that the buffer does not overflow. If the upstream modules do not communicate with each other, they can send cells destined for the same DQ at the same time. Under hot-spot traffic, cells destined for the hot-spot port can monopolize the buffer and the other cells destined for a non-hot-spot port can be blocked. This can lead to a deadlock situation. Although the senders may coordinate with each other to prevent buffer overflow, a deadlock situation can still occur. For example, assume that the shared memory size is 2048 cells. In the exemplary switch fabric described above, although one memory may include 8192 queues and each queue size must be at least 15 cells, most queues can be empty. If the number of cells in the shared memory exceeds a threshold (e.g., 1024 cells), the receiver may send a backpressure signal to all senders and in response, all senders may stop transmission until the receiver informs them that the shared memory is no longer congested. This can cause a deadlock under a packet interleaving scheme, such as the one described in the '733 provisional.
FIG. 3E illustrates the input-buffered SM in which 64 queues for 64 output ports are grouped to a buffer 370 associated with an input port. An input-buffered SM may require a small memory because it receives at most one cell in a time slot. Since the 64 queues can share the same memory space, the buffer size can be as small as a little larger than 15 cells (e.g., 32 cells). The flow control for the input-buffered SM will be simple as described in § 1.2.2 because the DQ may need to inform the status to only one upstream SM.
§ 1.2.2 Flow Control
As can be appreciated from the foregoing, one of the challenges in building a large capacity switch is to use memory efficiently. Off-chip memory takes a long time to access, while present (Year 2003) on-chip memory can contain at most a few thousand 64 Byte cells. Providing a large memory at line cards and a small memory at switch fabric cards is common. When a cell is transmitted over a link, the receiver should have free memory space to store the cell until it is transmitted to the next stage. If the receiver's limited memory becomes full, the sender should hold the cells until the receiver has free space. Therefore, a flow control mechanism may be implemented to avoid cell loss in the switch fabric.
§ 1.2.2.1 Previous Approaches to Flow Control and Their Limitations
The paper, H. T. Kung, Trevor Blackwell, and Alan Chapman, “Credit-Based Flow Control for ATM Networks: Credit Update Protocol, Adaptive Credit Allocation, and Statistical Multiplexing,” Proceedings of ACM SIGCOMM '94, pp. 101-114 (September 1994). (incorporated herein by reference and referred to as “the Kung paper”) proposed a credit-based, per virtual connection (VC), link-by-link flow control scheme for ATM networks, called the N23 scheme. Flow control based on credits is an efficient way of implementing per VC, link-by-link, flow control. The credit-based flow control method generally works over each flow-controlled VC link. Before forwarding any data cell over the link, the sender first needs to receive credits for the VC via credit cells sent by the receiver. Periodically the receiver sends credits to the sender indicating availability of buffer space for receiving data cells of the VC. After having received credits, the sender is eligible to forward some number of data cells of the VC to the receiver according to the received credit information. Each time the sender forwards a data cell of a VC, it decrements its current credit balance for the VC by one.
The receiver is eligible to send a credit cell to the sender each time after It has forwarded N2 data cells since the previous credit cell for the same VC was sent. The credit cell will contain a credit value for the combined area consisting of the N2 and N3 zones. N2 and N3 are fixed numbers. The sum of N2 and N3 is equal to the buffer size. If the credit is less than N2, it is in N2 zone. Otherwise, it is in N3 zone. Upon receiving a credit cell with a credit value of C for a VC, the sender is permitted to forward up to C-E data cells of the VC before the next successfully transmitted credit cell for the VC is received, where E is the number of data cells the sender has forwarded over the VC for the past time period of RTT (where RTT is a round-trip time of the link expressed in number of cell time slots, including the processing delays at sender and receiver). The subtraction of E from C accounts for in-transit cells from the sender to the receiver, which the receiver no knowledge of when it sent the credit cell. Specifically, the sender maintains a counter, called Credit_Balance, for the VC. Initially, Credit_Balance is set to be VC's credit allocation, N2+N3. Each time the sender forwards a data cell, it decrements the Credit_Balance by one. When the Credit_Balance reaches zero, the sender stops forwarding data cells. When receiving a new credit cell resulting in a positive value of C-E, the sender will be eligible to forward data cells again. More specifically, when receiving a credit cell for a VC, the sender will immediately update its Credit_Balance for the VC using Credit_Balance=C−E.
The N2 value can be a design or engineering choice. Suppose that x is the number of credit transactions a credit cell can incorporate. Note that one credit cell can have credits for many VCs. Thus, for example, if one credit cell has credits of 6 VCs, x=6. Then the bandwidth overhead of transmitting credit cells is 100/(N2*x+1) percent. If one credit cell can incorporate 6 credits (i.e., x=6) and the credit cell is sent right after 60 data cells are transmitted (i.e., N2=10), the bandwidth overhead is 1.64%. The larger N2 is, the lower the bandwidth overhead, but each VC will use more buffer.
The N3 value is determined by its bandwidth requirement of the VC. Let BVC be the targeted average bandwidth of the VC over the time RTT, expressed as a percentage of the link bandwidth. Then it can be shown that to prevent data and credit underflow, it suffices to choose N3 to be BVC*RTT. By increasing the N3 value, the VC can transport data cells at a proportionally higher bandwidth.
Unfortunately, the N23 scheme discussed in the Kung paper will not work with a multi-plane, multi-stage, packet switch because the number of VCs can be too big to be practical. A VC is defined as a TCP connection. If the number of TCP connections for a link is 100,000, the SM should have a separate queue for each VC, and the downstream SM should send a credit cell for each VC periodically. This may consume a significant portion of the link bandwidth. However, large-scale, high-speed, switches often use a multi-plane, multi-stage architecture. Accordingly, an improved flow control scheme that will work with multi-plane, multi-stage architectures is needed.