§1.1. Field of the Invention
The present invention concerns communications. In particular, the present invention concerns packet cell re-sequencing in a switch.
§1.2. Background Information
To keep pace with Internet traffic growth, researchers continually explore transmission and switching technologies. For instance, it has been demonstrated that hundreds of signals can be multiplexed onto a single fiber with a total transmission capacity of over 3 Tbps and an optical cross-connect system (OXC) can have a total switching capacity of over 2 Pbps. However, the capacity of today's (Year 2003) core Internet Protocol (IP) routers remains at a few hundred Gbps, or a couple Tbps in the near future.
It still remains a challenge to build a very large IP router with a capacity of tens Tbps or more. The complexity and cost of building such a large-capacity router is much higher than building an optical cross connect system (OXC). This is because packet switching may require processing (e.g., classification and table lookup), storing, and scheduling packets, and performing buffer management. As the line rate increases, the processing and scheduling time available for each packet is proportionally reduced. Also, as the router capacity increases, the time for resolving output contention becomes more constrained.
Demands on memory and interconnection technologies are especially high when building a large-capacity packet switch. Memory technology very often becomes a bottleneck of a packet switch system. Interconnection technology significantly affects a system's power consumption and cost. As a result, designing a good switch architecture that is both scalable to handle a very large capacity and cost-effective remains a challenge.
The numbers of switch elements and interconnections are often critical to the switch's scalability and cost. Since the number of switch elements of single-stage switches is proportional to the square of the number of switch ports, single-stage architecture is not attractive for large switches. On the other hand, multi-stage switch architectures, such as a Clos network type switch, is more scalable and requires fewer switch elements and interconnections, and is therefore more cost-effective.
FIG. 1 shows a core router (CR) architecture 100 which includes line cards 110,120 a switch fabric 130, and a route controller (not shown) for executing routing protocols, maintenance, etc. The router 100 has up to N ports and each port has one line card. (Note though that some switches have ports that multiplex traffic from multiple input line cards at the ingress and de-multiplexes the traffic from the switch fabric to multiple line cards at the egress.) A switch fabric 130 usually includes multiple switch planes 140 (e.g., up to p in the example of FIG. 1) to accommodate high-speed ports.
A line card 110,120 usually includes ingress and/or egress functions and may include one or more of a transponder (TP) 112,122, a framer (FR) 114,124, a network processor (NP) 116,126, and a traffic manager (TM) 118,128. A TP 112 may be used to perform optical-to-electrical signal conversion and serial-to-parallel conversion at the ingress side. At the egress side, it 122 may be used to perform parallel-to-serial conversion and electrical-to-optical signal conversion. An FR 114,124 may be used to perform synchronization, frame overhead processing, and cell or packet delineation. An NP 116,126 may be used to perform forwarding table lookup and packet classification. Finally, a TM 118,128 may be used to store packets and perform buffer management, packet scheduling, and any other functions performed by the router architecture (e.g., distribution of cells or packets in a switching fabric with multiple planes).
A switch fabric is a device that cross-connects packets from an input port (i.e., packet arriving port) to an output port (i.e., packet departing port) for unicast traffic, and to multiple output ports for multicast traffic. The switch fabric may operate at very high speed (e.g., new configuration in every 200 nsec).
When a packet arrives at CR 100, it determines an outgoing line to which the packet is to be transmitted. Variable length packets may be segmented into fixed-length data units, called “cells” without loss of generality, when entering CR 100. The cells may be re-assembled into packets before they leave CR 100. Packet segmentation and reassembly is usually performed by NP 116,126 and/or TM 118,128.
FIG. 2 illustrates a multi-plane multi-stage packet switch architecture 200. The switch fabric 230 may include p switch planes 240. In this exemplary architecture 200, each plane 240 is a three-stage Benes network. Modules in the first, second, and third stages are denoted as Input Module (IM) 242, Center Module (CM) 244, and Output Module (OM) 246. IM 242, CM 244, and OM 246 have many common features and may be referred to generally as a Switch Module (SM).
Traffic enters the switch 200 via an ingress traffic manager (TMI) 210 and leaves the switch 200 via an egress traffic manager (TME) 220. The TMI 210 and TME 220 can be integrated on a single chip. Therefore, the number of TM chips may be the same as the number of ports (denoted as N) in the system 200. Cells passing through the switch 200 via different paths may experience different queuing delays if the switch fabric has a queuing buffer in it. These different delays may result in cells arriving at a TME 220 out of sequence. FIG. 2 illustrates multiple paths between TMI(0) 210a and TME(0) 220a. 
In the embodiment 200 illustrated in FIG. 2, the first stage of a switch plane 240 includes k IMs 242, each of which has n inputs and m outputs. The second stage includes m CMs 244, each of which has k inputs and k outputs. The third stage includes k OMs 246, each of which has m inputs and n outputs. If n, m, and k are equal to each other, the three modules 242,244,246 may have identical structures.
From the TMI 210 to the TME 220, a cell traverses four internal links: (i) a first link from a TMI 210 to an IM 242; (ii) a second link from the IM 242 to a CM 244; (iii) a third link from the CM 244 to an OM 246; and (iv) a fourth link from the OM 246 to a TME 220.
In such a switch 200, as well as other switches, a number of issues may need to be considered. Such issues may include packet cell re-sequencing.
The switch fabric may be required to deliver packets belonging to the same flow in order. Generally speaking, a flow refers to a virtual connection from a source end system to a destination end system. However, in this specification, a “flow” will be used to refer to a packet stream with the same input port and the same output port. If packets belonging to the same flow are not delivered in order through the switch fabric, the switch fabric is assumed to have a packet out-of-sequence problem.
An input port normally sends cells in order for all flows. However, if the switch fabric has multiple paths and each path may have a different delay due to the contention for the same output link at each stage of the switch fabric, the output port may receive cells out-of-order. Therefore, the output port needs to re-sequence cells according to their sequence number (SN) at each virtual input queue (VIQ).
§1.3 Previous Approaches to Solve Packet Out-Of-Sequence Problems in Buffered Multi-Path Switch Fabrics, and Limitations of Such Approaches
Two methods have been proposed to solve the packet out-of-sequence problem in the buffered multi-path switch fabric. The first method re-sequences packets at the output port. The packet re-sequencing may require several conditions. First, each packet should carry a sequence number. One exemplary sequence number is a time-stamp based on the arrival time of the packet at the input port. If the sequence number is large, the overhead ratio (of sequence number size to cell or packet size) can be too big to be practical. A high overhead ratio can cause increased implementation costs, performance degradation due to reduced internal speedup, or both. Second, the degree of packet out-of-sequence should be bounded to ensure successful re-sequencing. Since Internet traffic is very complicated, it is difficult to estimate the degree of packet out-of-sequence that will occur. Even when the degree of packet out-of-sequence is bounded, implementing the re-sequencing circuits increases costs.
The second method to solve the packet out-of-sequence problem is to send all packets belong to the same flow over the same path. This emulates a single-path switch fabric for a given flow, thus avoiding packet out-of-sequence problems altogether. This idea is attractive in the sense that the packet out-of-sequence problem is only matters for the packets belong to the same flow. This scheme is referred to as “static hashing.” Static hashing advantageously eliminates the re-sequencing buffer at the output port. Since packets belonging to the same flow take the same path in the multi-path switch fabric, they will arrive at the output port in the proper sequence.
Note that re-sequencing is different from re-assembly. Re-sequencing is a term used to describe an operation to correct the situation when packets belonging to the same flow arrive at the output port out-of-sequence. Re-assembly is a term used to describe reconstituting packets when the packets are segmented into cells and are interleaved in the switch fabric. For purposes of this discussion, it is assumed that packets are not interleaved in the switch fabric. In other words, all cells belonging to the same packet will be sent back-to-back, without any intervening cells. Therefore, with static hashing, the output port has no re-sequencing buffer, nor does it have a re-assembly buffer.
One problem of the static hashing scheme is the potential for load imbalance. Since each flow may have different bandwidth, it is possible that one path will be more congested than another path, or other paths. This may complicate choosing proper paths to route packets from an input port to an output port. If paths are not properly chosen, the probability of congesting one path increases, adversely impacting switch performance.
U.S. Provisional Application Ser. No. 60/479,733 (incorporated herein by reference), titled “A HIGHLY SCALABLE MULTI-PLANE MULTI-STAGE BUFFERED PACKET SWITCH”, filed on Jun. 19, 2003, and listing Hung-Hsiang Jonathan Chao and Jinsoo Park as inventors, and U.S. patent application Ser. No. 10/776,574 (incorporated herein by reference) titled “PACKET SEQUENCE MAINTENANCE WITH LOAD BALANCING, AND HEAD-OF-LINE BLOCKING AVOIDANCE IN A SWITCH” and listing Hung-Hsiang Jonathan Chao and Jinsoo Park as inventors, describe some approaches to the packet (or cell) out-of-sequence problem. Although such approaches represent an advance solution to packet and cell out-of-sequence problems, better solutions would be useful.