1. Field of the Invention
The present invention relates to cell switching architectures, particularly fault-tolerant cell switching architectures.
2. State of the Art
As changes in the field of networking and telecommunications have occurred, it has become increasingly evident that existing time division switches are inadequate for handling the bandwidth requirements of emerging cell switching technologies such as Asynchronous Transfer Mode (ATM). Cell switching involves breaking data into small, fixed-size units. A standard ATM cell has a payload of 48 bytes. A packet, by contrast, may be considerably longer and is not fixed length. A cell switch accommodates packet data by breaking packets up into cells.
New technologies are needed to provide the ultra high bandwidth switching capability being sought in the near future. A challenge is to efficiently implement switches with a large number of physical ports (128-2K ports) operating at gigabit data rates (0.5-10 Gbps/port) and having 0.1 to 10 terabits/second aggregate bandwidth capacities.
Current telecommunications switch systems are typically based on the crossbar, shared memory, or shared medium (e.g. bus and ring) switch architectures. While these architectures are adequate for today""s networking applications, scaling them to meet future switching demands presents a formidable challenge. There are substantial engineering tradeoffs to take into consideration when deciding on a switch architecture that has to scale to over 1000 physical ports and operate at gigabit port data rates. Physical packaging issues become very important. Technologies, architectures and systems which have worked well for a 64 port switch operating at 155 Mbps per port are impractical for a 1000 port switch operating at 1 Gbps per port. For example, both the interconnect and circuit complexity of a crossbar switch with N input/output ports grows as O(N2), making it impractical for network sizes of 1000 ports and above. Likewise, both shared memory and shared medium architectures become impractical if not infeasible beyond a given switch size due to speed limitations in the sequential access of a single shared resource.
Self-routing multistage networks have also been proposed as the basis for high-performance packet networks for telecommunications in the form of ATM switches. The basic appeal of multistage interconnection networks lies in their inherent simplicity and their scalability to large numbers of ports. For example, U.S. Pat. No. 5,541,914, incorporated herein by reference, describes a class of packet-switched, extended-generalized-shuffle, self-routing multistage interconnection networks (MINs). The network provides a performance/cost trade-off between, on the one hand, the knockout switch or buffered crossbar and, on the other hand, the tandem banyan network. Multiple copies of the network may be serially cascaded back-to-back, and connected in parallel. Applications to broadband telecommunications switching are described.
MIN-based switching architectures, however, do not enjoy inherent fault tolerance. Achieving fault tolerance generally requires over-dimensioning the switch, cascading multiple MIN switching networks, etc. These solutions are complex, expensive, and inelegant.
A different problem is that of providing an interconnection network for massively parallel processing (MPP) computers having thousands of compute nodes. MPP interconnection networks, like switching networks, require high bandwidth and fault tolerance. One MPP interconnection network is that of Danny Hillis""s well-known Connection Machine, described in U.S. Pat. No. 4,598,400. In the Connection Machine, a hypercube architecture is used for communication between clusters of processors. In that patent, a hybrid form of circuit and cell switching is used. On each routing cycle, an attempt is made to form a path from the source of a message packet to the destination. When successful, a message travels the entire path in a single routing cycle. In the event that a complete route is unavailable, the packet is delayed until the next routing cycle.
Some background regarding hypercubes is required for an understanding of the prior art and of the present invention.
A binary hypercube is defined in terms of graph theory. A graph is a set G={VE} where V is a set of nodes (also called vertices) and E is a set of edges connecting the nodes. In general, a graph can have any number of nodes and any number of edges up to the number of edges in a completely connected graph which is limited to v(vxe2x88x921)/2 where v=|V| (or the size of the set V). A binary hypercube is defined in the following way:
A D-dimensional binary hypercube is a graph G={V, E}. V is a set of 2D nodes where each node is given a unique D-digit binary number as an address. An edge exists between two nodes if their addresses differ in exactly one digit.
FIG. 25 shows the first three non-trivial binary hypercubes. A 0-dimensional binary hypercube is simply a single node with no edges. FIG. 25(a) shows a 1-dimensional binary hypercube. There are two nodes in the 1-dimension and one edge connecting them. FIG. 25(b) shows a 2-dimensional binary hypercube. In this hypercube, there are four nodes and four edges. There is an edge between nodes 01 and 00 since their addresses disagree only in the second position. There is no edge between nodes 10 and 01 since the node""s addresses disagree in both positions. Edges generated because addresses differ in the first position are said to be in the zeroth dimension; edges generated because addresses differ in the second position are said to be in the first dimension; etc. FIG. 25(c) shows a 3-dimensional binary hypercube. The hypercube addresses are constructed from right to left. Edges that exist because addresses differ in the right-most digit are said to be connecting nodes in the first dimension (or sometimes zeroth dimension if counting starts from zero); edges that exist because addresses differ in the second digit from the right are said to be connecting nodes in the second dimension; and etc. In the diagrams, edges in the first dimension are drawn vertically, edges in the second dimension are drawn horizontally, and edges in the third dimension are drawn diagonally.
Hypercubes have several properties. The number of edges in a D-dimensional binary hypercube is d(2)d/2 since there an edge per dimension ending at a node, there are (2)d nodes and each edge has two ends (being bidirectional). The significance of this formula is that the number of edges grows proportionally to the number of nodes times the log of the number of nodes. If distance between nodes is measured as the number of edges in the shortest path between the two nodes, then the longest distance in a D-dimensional hypercube is D. If p is the distance between two nodes, then p! is the number of shortest paths between the nodes.
Comparing hypercubes to completely connected networks, although path length is always a constant 1 in a completely connected network, a completely connected network of n nodes requires n(nxe2x88x921)/2 edges, meaning the number of edges grows proportionally to the square of the number of nodes. A hypercube, as already mentioned, has many fewer edges for large n. Comparing hypercubes to sparse networks such as rings, a ring of n nodes may require only n edges, but the maximum distance between two nodes is n/2 and there are only two paths between any two nodes. In a hypercube, the maximum distance between any two nodes is the log of the number of nodes, and there are many available paths. In addition, every node in a hypercube has the same structure: there are no special nodes like the nodes at the center of a star network, where the entire network is disconnected if the center is removed.
Two approaches to switching in hypercubes are circuit switching and cell switching. Circuit switching typically reserves an entire path for a data flow through the hypercube for an arbitrary duration, much like the end-to-end connection of a telephone call. The advantage of circuit switching is that arbitrary amounts of data may be transferred at wire speed along the circuit in-order and with low latency. The disadvantage is that the circuit consumes resources that will not be available to other data flows until the circuit is released regardless of how much data is actually being transmitted on the circuit. Cell switching requires all data packets to be of a fixed size (called a cell). No fixed paths through the hypercube are allocated. Instead, every node receives cells and determines locally on which edge to transmit the cell to move the cell closer to its destination. In cell switching, different cells of the same dataflow may traverse different paths from source to destination, and different dataflows may share the same path. This form of switching is also called store-and-forward or hop-by-hop routing.
Although hypercube architecutures have been used in MPP interconnection networks, the technical requirements within these two different application spaces (telecommunications switching and MPP interconnection) are considerably different. Reliability and speed are key issues for telecommunications. Telecommunications equipment must meet standards of reliability required within the telecommunications industry. This includes the reliability of service experienced by a customer as well as the dependability. If a customer has paid for data communications to support video transmission, then corrupted, late, or dropped data can result in unsatisfactory service. In an MPP network, late data may slow the speed of computation, but typically it does not invalidate a particular computation. In a telecommunications setting, a device must also be protocol capable to the extent needed by a particular context. Typically, telecommunications protocols are more complex than MPP communications protocols. An excellent technology for telecommunications switching might provide an excellent basis for an MPP interconnection, due to its more stringent requirements. The telecommunications switching technologies described herein are therefore believed to be equally applicable to MPP interconnection networks.
There is a need for a switching architecture that offers a powerful, simple, and in many ways elegant solution to the problem of providing cost-effective, high-bandwidth, fault-tolerant data switching.
Generally speaking, the switching architecture of the present invention offers a powerful, simple, and in many ways elegant solution to the problem of providing cost-effective, high-bandwidth, fault-tolerant cell switching. The architecture is based on a network of switching elements connected in a hypercube topology to form a switch fabric.
In the exemplary implementation, a node in the graph of a hypercube corresponds to a switching element. The edges in the graph of a hypercube correspond to the wires in the internal fabric of the switch. Each switching element is in some way attached to the outside world, which in a network is another switch or a communicating end station. A Source Sink Element (SSE) is a portion of the switch element that passes cells between the outside world and the switch fabric. A Saturated Constant Shuffle Router (SCS Router) is a portion of the switch element that together with connections to other SCS Routers implements the switch fabric.
The present invention is an example of cell switching, providing an efficient means for the transfer of data through a hypercube. A variety of features contribute to high Quality of Service (QoS) as measured by cell loss, cell delay and cell jitter (variance of cell delay). Zero cell loss is possible in theory, and in practice cell loss can be reduced to an arbitrarily-chosen statistical bound, as may be seen from an analysis of the inputs and outputs of an SCS router. Considering first internal links (neighbor to neighbor), D cells may be output from a switching element every cycle. Likewise, D cells may be input to an SCS router every cycle, one on each of D links. A small number Q of additional cells may be buffered within each SCS router providing additional candidates for routing and thereby increasing wire utilization. During a given cycle, the contents of a queue buffer may be delivered or transmitted on some wire or may be replaced to queue. The contents of a queue buffer may also be replaced by a cell received during the previous cycle. A switching element is therefore able to handle at least D+Q cells every cycle. The switching element""s data source may also, on any given cycle, have a cell to inject into the fabric, for a worst-case maximum of D+Q+1 cells to be handled by the switching element during a given cycle. During the same given cycle, the switching element may or may not be able to output a cell to its data sink if there is no cell present with a zero routing code. If injection were allowed in this case, then a cell would have to be dropped by the router since there are only D+Q output channels available and D+Q+1 cells to route. Cell loss is avoided in this case by disallowing cell injection during the cycle. The data source is immediately informed of cell injection failure and may re-attempt cell injection during the following CEC. This behavior is distinct from and more efficient than head-of-line blocking encountered in many other switches. Fundamentally then, the switch is capable of zero cell loss.
Furthermore, by speedup of the switch fabric, the switch may be tuned to run at a point where cell delay is statistically bounded. A cell reordering mechanism prevents out-of-order cell delivery while maintaining acceptable limits of cell jitter. By tuning the mechanism appropriately, cell jitter may also be statistically bounded. A single data stream may send a stream of cells beyond the capability of a destination port to deliver due to cells received from other input ports. When data streams contract for specific QoS and inject cells only at or below the contracted rate, overloading an output port will not happen. When a datastream is not bound to a specific QoS, it may burst cells that would take resources from better behaved data streams of the same priority. Mechanisms to prevent adverse effects to better behaved data streams at a destination port are implemented using per-VC (Virtual Channel) queueing at an SSE.
The switching architecture provides for fault tolerance and congestion avoidance. In an exemplary embodiment, every switching element checks each of its links at regular intervals, preferably every cycle. (Dummy cells, or bubble cells, are exchanged on idle links.) In the case of repeated failure to receive a cell with a valid header on a given input link, a corresponding output link is marked as faulty, and traffic is routed around the faulty link. That is, if a switching element notices that it is unable to receive cells from a neighboring switching element, then it will not send cells to that neighboring switching element. This same fault tolerance feature allows for incremental switch sizing, partial hypercube configurations, and the ability to dynamically add to or remove from the fabric. Switching elements that are not physically present have their links marked as faulty by those switching elements that are present such that no traffic is routed to or through the missing switching elements.
In any switch, output channel congestion can occur when multiple sources send to a single destination and the combined data rate of the sources exceeds the destination""s maximum bandwidth. The ATM data protocol attempts to minimize this possibility by providing bandwidth contracting and policing of data sources. Important legacy data protocols, like IP, typically do not make use of these facilities and suffer output channel congestion. Standard strategies for handling output channel congestion including injection packet discard (IPD), early packet discard (EPD) and partial packet discard (PPD) take advantage of packetized data to minimize packet retransmission after cell loss due to output channel congestion. Data is packetized when a data protocol segments arbitrarily large amounts of data into streams of contiguous, fixed-length cells. In ATM, the last cell of a data packet is indicated in the ATM cell header by the packet boundary indicator (PBI). When a single cell of a packet is lost, the entire packet must be retransmitted. Since the packet will be retransmitted, the various packet discard strategies minimize congestion by actively dropping the remaining cells of a packet (not including the last cell with the PBI) to eliminate current congestion. In the exemplary embodiment, an output channel congestion avoidance mechanism is implemented using trouble indicator bits contained within a header of each cell. Output channel congestion is evident in a switching element retrograding cells for which it is the destination because delivery channels are full. Congestion is relieved by sending messages to the sources of such traffic to turn them off. Messaging may be accomplished through a central facility or in a distributed manner by flooding the cube with messages containing information that a particular switching element is congested. Sources receiving a congestion message for a particular switching element enter a packet discard mode in which packets destined for that switching element are discarded.