A system architecture that is suitable for high capacity switches includes a set of nodes, each node containing external input/output (I/O) ports as well as being part of a distributed switching fabric.
In U.S. Pat. No. 6,370,145 (Dally, et al.) is described an example of a switching system (an internet router) which is composed of a multi-hop network of fabric routers (nodes or switch elements) which effectively constitute a distributed switch fabric providing connectivity between I/O ports contained within the fabric routers.
User data traffic may enter the system at an I/O port of one of the nodes (the ingress node) and leave through an I/O port of another node (the egress node). Traffic may be routed from one I/O port of a node to an I/O port on the same node, but the case of greater interest is where the egress node differs from the ingress node. If the ingress node does not have a direct link to the egress node, data traffic is switched through a number of intermediate nodes acting as tandem nodes.
In a distributed fabric architecture, all nodes are of equal or similar design and contain means to fulfill the roles of ingress, tandem, and egress nodes dynamically as required.
To switch traffic, virtual circuits (VCs) are set up between ingress nodes and egress nodes, where the forward channel is used to transmit user data, and the reverse channel carries flow control (back pressure) signals. The reverse channel may also be carrying user data in the opposite direction, and flow control signals are usually combined with user traffic.
Such system architecture relies on large input buffers and output buffers associated with the I/O ports of each node, and an end-to-end flow control regime to guarantee a high quality of service. On its way from an ingress node to an egress node, traffic going through a tandem node however bypasses the I/O port buffers in the nodes that are acting as tandem nodes. As in any multi-stage fabric, the internal links between the nodes can be overloaded if appropriate measures are not taken.
A commonly used measure to prevent data loss is to provide link-by-link flow control on the internal links between the nodes. This is a second type of backpressure or flow control, in addition to the end-to-end flow control regime provided between ingress and egress nodes. For cost and delay reasons, the sizes of buffers in the tandem nodes are kept small, requiring a very fast flow control mechanism capable of providing rapid backpressure to the port buffers.
In U.S. Pat. No. 6,285,679 (Dally, et al.) is described a multi-hop distributed switch system in which virtual circuits (VCs) are set up between ingress and egress nodes, through tandem nodes containing small buffers, one per VC, that can fill up quickly. A credit based flow control scheme is employed to propagate the state of tandem buffers back to ingress nodes.
Complications may arise in the design of the nodes and the backpressure mechanism when the links between the nodes are not direct but are made up of multiple links in parallel. This arrangement may be chosen to provide a higher capacity of transmission between nodes than becomes possible, or economically viable, with a direct (back plane or fiber link) connection. However, the available higher capacity must be utilized effectively to carry the traffic stream and the flow control signals.
Ribbon fiber cables and high-speed multi-fiber electro-optical transceiver modules have recently become available to enable such a system design. In U.S. Pat. No. 6,307,906 (Tanji, et al.) is described the basic concept of using a ribbon fiber cable for module interconnect, including a clock and data recovery scheme. Unfortunately, using a ribbon fiber cable as a parallel bus to interconnect the modules of a system has some disadvantages, e.g. when errors or failures of individual links within the cable are considered. For example, when used as a simple parallel bus, the loss of an individual link renders the entire bus unusable.
Another method to use a ribbon fiber cable is to consider each fiber as a serial channel (carrying complete cells or packets), and then use an inverse multiplexing scheme to distribute the traffic over the fibers in the cable, typically in a round-robin mode. This method would result only in some loss of capacity when a single link fails.
Inverse multiplexing was first proposed on a network scale, to bundle multiple lower speed links into a single higher speed logical link. Network scale inverse multiplexing is described in numerous U.S. Pat. No., among which U.S. Pat. No. 5,608,733 (Vallee, et al.), U.S. Pat. No. 5,875,192 (Cam, et al.).
The use of inverse multiplexing on a module-to-module scale is described in U.S. Pat. No. 6,188,699 (Lang, et al.). In such a scheme each physical link uses individual transmit and receive circuits for conveying data from the transmitter to the receiver, and common management circuits and packet buffer processors for coordinating the transfer over the group of physical links.
However, existing inverse multiplexing schemes are only adapted to the transfer of data between nodes that are capable of terminating packet (cell or ATM) protocols in the case of network scale inverse multiplexing, or contain network processors or the like in the case of module scale inverse multiplexing.
In a large switching system with a large number of internal virtual circuits (VCs) each requiring a queue per VC in each tandem node through which the VC passes, and with very high speed links connecting the nodes to each other, there are two important requirements: the cost of the intermediate buffers must be kept as low as possible, but their sizes must be adequate to handle the feedback volume.
Feedback volume is a term used to describe the amount of traffic (number of data packets) that will arrive at a receiver after the receiver has sent a backpressure signal to the transmitter. The feedback volume depends on the link speed, and on the delay of both the data path from the transmitter to the receiver, and the feedback path from the receiver to the transmitter.
A large number of VC's implies a large number of queues, and also a large number of flow control signals, which must be conveyed rapidly from the receivers to the transmitters. A large amount of flow control traffic requires a significant amount of bandwidth that is then not available for data traffic. If less bandwidth is made available for flow control, the end-to-end delay for flow control signals from receiver to transmitter is increased, which has the effect of increasing the required size of buffers at the receiver.
As a consequence, a very careful design decision must be made to provide sufficiently rapid flow control without using up an inordinate amount of bandwidth for control signals.
A reliable method of flow control is based on the concept of continuously reporting the receiver's queue and buffer status to the transmitters. The queue status may be the number of buffer spaces available to the queue of a VC, or it may be a single logical bit to express whether a certain fill threshold has been exceeded for a queue. The buffer status (irrespective of VC) may similarly be a number expressing the total amount of space available in the buffer, or a single logical bit triggered when a certain fill threshold has been exceeded. Both VC queues and buffer space may be divided according to a number of priority levels, and status information may be generated separately for each priority. The queue and buffer status information can be carried in the header of data packets (cells), including the headers of idle cells, or it can be transmitted in the payload of designated flow control cells. Flow control cells could be transmitted whenever there are no user data cells to be transmitted, but in the critical high-load situation flow control cells must be inserted at a minimum rate.
When multiple links, for example a ribbon multi-fiber cable, are employed to interconnect nodes, the bandwidth available on the multi-fiber link as a whole is the sum of the bandwidths of the individual links. However, under failure conditions, the aggregate bandwidth available on the multi-fiber link as a whole can be reduced, which may lead to a problem of increasing of the feedback volume, and cause buffer overruns and data loss.
Accordingly, there is a need in the industry for further development of means and methods of handling data and back pressure signals over such multiple links under variable conditions.