Multi-stage scaling of interconnection networks has long been used in circuit switching, where contention-free paths through the network are set up and torn down on a per-connection basis, eliminating the need for buffers in the fabric. Although this approach can in principle also be applied to packet switching, as the network capacity grows in terms of ports and link rate, it quickly becomes infeasible to compute and apply contention-free network configurations within the duration of a single packet. Therefore, buffered switch elements are more suitable as a building block for packet-switched networks. The traditional approach is to use switch elements with individual output buffers per output port or with a shared memory which functions as a shared output buffer for multiple output ports.
A well-known, traditional approach to build large packet-switched interconnection networks out of smaller building blocks is by arranging a number of packet switch elements in a multi-stage topology, such as a Banyan, Benes, Clos, Fat Tree, Hypercube, or Torus. FIG. 1, for example, shows a two-level interconnection network of packet switch elements S1 to S6 that are interconnected in a Fat Tree topology, which utilizes bidirectional links L. The end nodes N1 to N8 of the interconnection network are located at the bottom of the tree. Each packet switch element S1 to S6 routes a data packet from an ingress or input port IP to one or more egress or output ports OP. The full network of interconnected packet switch elements S1 to S6 is capable of routing packets from any of the end nodes N1 to N8 (shown at the bottom of the network) to any of the other end nodes N1 to N8.
In general, the two-level Fat Tree topology lets an N-port switch element support a network of ½*N2 end nodes, with full bisection bandwidth. In FIG. 1 the two-level packet-switched interconnection network comprises switching elements S1 to S6 with N=4 ports and with that supports a network with ½*N2=8 end nodes N1 to N8. Other topologies may allow more or fewer end nodes to be interconnected, with more or less bandwidth between end nodes.
In the packet-switched interconnection network shown in FIG. 1 the basic packet switch elements S1 to S6 can be constructed in a variety of ways, but all packet switch elements S1 to S6 incorporate means for routing packets from ingress ports IP to egress ports OP. For example, in FIG. 2, the packet switch element S1 comprises the routing fabric R1. Furthermore, a multi-stage packet-switched interconnection network comprises links L for interconnecting the packet switch elements S1 to S6 to end nodes N1 to N8 and to other packet switch elements S1 to S6. In addition, the packet switch elements S1 to S6 comprise input packet buffers I(1,1) to I(6,4) and output packet buffers O(1,1) to O(6,4), which are located at the ingress ports IP and the egress ports OP of the packet switch elements S1 to S6 respectively. Additionally, to arrange such a packet switch element S1-S6 in a multi-stage topology and avoid uncontrollable packet loss due to input or output buffer overruns, each packet switch element S1 to S6 comprises some means for controlling the flow of packets between subsequent stages or packet switch elements.
Traditional packet switch elements have buffers at the ingress ports IP and as well at the egress ports OP of the fabric as shown in FIG. 2. This allows straightforward, point-to-point link-level flow control between the egress buffers O(n,x) of stage n and the ingress buffers I(n+1,y) of stage n+1. Flow control information is transmitted in the direction opposite to the direction of the packet flow. Assuming that all links L are bidirectional, the flow control information can be transmitted in-band, on the same links L used for transmitting the data packets in the reverse direction, i.e., from stage n+1 to stage n. In this context in-band transmission means that the flow control information is transmitted over the links L and not over the control lines CL.
FIG. 2 furthermore illustrates a local flow control loop FC1 and a remote flow control loop FC2 in the interconnection network comprising input buffers I as well as output buffers O. This type of switch element structure may be termed “Combined Input- and Output-queued” (CIOQ). Typically, the input and output buffers I and O of the same port P physically reside on the same line card.
As an example of how a remote flow control loop operates in such a CIOQ packet switch element, consider the remote flow control loop FC2 between port P4 of packet switch element S1 and port P2 of packet switch element S2. The flow control information is generated by the input buffer I(2,2), e.g., by the release of a credit or assertion of a stop signal because of the crossing of a threshold. The input buffer I(2,2) passes this information internally to the output buffer O(2,2), which inserts the flow control information in the header of a data packet traveling via link L to the input buffer I(1,4), or injects an idle packet if there is no data packet. The input buffer I(1,4) extracts the flow control information from the header of the received data packet and passes it internally to the output buffer O(1,4), which performs the flow control bookkeeping, e.g. incrementing the available credit or (re-)setting an on/off (start/stop) flag.
The expression available credit shall be understood as a counter value which indicates how many packets can be received from a transmitter without losing data. That is, the available credit represents the size of the buffer space of the receiving buffer. Therefore, the higher the available credit of a certain receiving buffer is, the more data packets said receiving buffer can receive without data loss. By communicating the available credit, the receiving buffer can inform a transmitter about how much data the transmitter can still send without causing a buffer overflow.
Changes in the available credit of a receiving buffer, e.g. owing to the departure of a packet, can be communicated by absolute credits, indicating the absolute size of the available buffer space, or by incremental credits, which indicate only the change in the size of the available buffer space. An incremental credit comprises a credit count, indicating the magnitude in the change of the available credit, and optionally an identifier indicating the receiving buffer. The transmitter maintains a credit counter that reflects the available credit of the corresponding receiving buffer. Upon receipt of an incremental credit this counter is incremented according to the indicated value. Upon transmission of data, it is decremented according to the size of the transmitted data.
The expression on/off (also referred to as start/stop) flow control denotes a command with which a data packet transmission from an output buffer to input buffer or from an input buffer via a packet switch element to an output buffer is started (on) or stopped (off).
Pertaining to a switch element S under consideration, the expression “local” shall be understood as pertaining to the same switch element S, e.g., in FIG. 2, routing fabric R1, arbiter A1, input buffers I(1,1) . . . I(1,4), and output buffers O(1,1) . . . O(1,4) are all considered local with respect to switch element S1.
Pertaining to a switch element S under consideration, the expression “remote” shall be understood as pertaining to a switch element connected to the switch element S via one or more links L. E.g., in FIG. 2, output buffer O(2,2), as a part of switch element S2, is considered remote with respect to switch element S1.
The above mentioned switch and network designs are described in Chapter 2 of Andrew Tanenbaum, “Computer Networks”, Prentice Hall PTR, 2002, and in Chapter 1 of Jose Duato, Sudhakar Yalamanchili, Lionel Ni, “Interconnection Networks”, Morgan Kaufmann, 2002.