1. Field of the Invention
The present invention relates to the field of packet switching, and more particularly to the field of input-queued packet-switch architectures, with particular applicability to computer interconnection networks.
2. Description of the Related Art
Advances in transmission technologies and parallelism in communications and computing are constantly pushing the envelope of bandwidth available to transfer information data. For instance, advances such as wavelength-division multiplexing (WDM) and dense WDM (DWDM) greatly increase available bandwidth by multiplexing large numbers of channels onto a single fiber. Each individual channel operates at the Optical Carrier (OC-x) rates OC-48 (2.5 Gb/s), OC-192 (10 Gb/s), or OC-768 (40 Gb/s). Using state-of-the-art DWDM techniques, a single fiber can carry over 5 terabit of data per second.
At the same time, the gap is widening between the increasingly high speeds provided by such advances and the speeds at which available switches are capable of switching. While optical switches provide such theoretical advantages of routing through free space, minimal signal attenuation over large distances, and the elimination of conversion between the optical domain to the electrical domain and back again, current all-optical type switches are relatively slow or prohibitively expensive. In addition, optical storage of information is very cumbersome and often impractical. Until the shortcomings of optical switching are overcome, electronic switches will continue to play a dominant role in packet switching schemes.
Typically, a backplane switch, or more generally a routing fabric is used to interconnect boards. In networking systems, these boards are called line cards, and in computing and storage, they are often called adapters or blades. An increasingly broad set of systems uses backplanes to connect boards, such as telecommunication switches, multiservice provisioning platforms, add/drop multiplexers, digital cross connects, storage switches, routers, large enterprise scale switches, embedded platforms, multiprocessor systems and blade servers.
When information data is transmitted from a source to a destination through an interconnect system, the information often is first segmented into data packets. Each data packet typically includes a header, payload and tail sections, and is further segmented into smaller units. A data packet is switched through a routing fabric simultaneously with other data packets originating from other sources. Many current packet switch systems, including those employed in interconnects for parallel computers, Internet routers, S(t)AN networks, Asynchronous Transfer Mode (ATM) networks, and especially in optical networks, use an input-queuing arrangement, include queues sorted per output at every line card (such an arrangement, often called virtual output queuing (VOQ), eliminates head-of-line blocking inherent to the use of FIFO queues), a crossbar routing fabric, and a centralized scheduler (e.g., arbiter or arbitration unit), that allocates switching resources and arbitrates among the queues.
FIG. 1 shows a conventional switching arrangement utilizing VOQ architecture. In the FIG. 1 arrangement, data packets (e.g., cells, frames or datagrams) are received from each of N data links 2a1 to 2aN of respective lines cards 102. The data packets are sorted per output 31 to 3N of a routing fabric 106 (shown as an N×N crossbar) into one of N buffers of N buffer groups 121 via a multiplexer 105a. That is, at each input line card 102, a separate queue is maintained for each output 31 to 3N, resulting in N2 VOQ's at the input side of the routing fabric 106. An arbiter 107 is provided to manage contention between data packets pursuing a same output of the router fabric 106 and to match the inputs to the outputs. The arbiter 107 communicates with each of the line cards along control paths 108, 109 and provides a switching configuration to the routing fabric 106 along control path 112. Typically, the arbiter 107 is located physically close to the routing fabric 106.
The arbiter 107 performs processes involving allocating input ports and output ports in a non-conflicting manner to packets waiting in the buffers of the buffer groups 121. These processes include allocation and arbitration. Allocation determines a matching between the inputs 2b1 to 2bN and the outputs 31 to 3N of the routing fabric 106 such that at most one packet from each buffer group 121 is selected for output to at most one output resource. Arbitration resolves multiple requests for a single output resource 31 to 3N and assigns a single one of the outputs to one of a group of requesters. In the conventional arrangement of FIG. 1, the arbiter 107 receives requests for switch access from the line cards 102 on the control path 108. The arbiter 107 computes a match, based on received requests and a suitable matching algorithm, to determine which of the inputs 2b1 to 2bN is allowed to forward a data packet to which output in each of a number of time slots. Each line card 102 winning access (i.e., granted access) is sent a control message along the control path 109 to inform the line card 102 that it is permitted to transmit a data unit, such as a packet, in a particular time slot or switching cycle to a specified output. During the time slot, the arbiter 107 transmits the computed switching configuration to the routing fabric 106, and each winning line card 102 releases a unit of a data packet from a queue in its buffer group 121 through demultiplexer 105b and transmits the data unit along its corresponding input 2b1, . . . , or 2bN to the routing fabric 106. Each data packet is then transmitted through the routing fabric 106 along the path configured by the arbiter 107 to the requested one of the outputs 31, . . . , or 3N.
As can be seen, there are two basic paths of communication in such input-queuing systems: control paths, which include flow of control information from the line cards to the arbiter (e.g. requests) and back to the line cards (e.g. grants), and data paths, which include flow of data packets from the input line cards through the crossbar and to the output line cards.
While the conventional packet switching arrangement illustrated in FIG. 1 shows a routing fabric 6 having only one-way communication paths, it will be appreciated that this general concept includes bi-directional data and control paths. For instance, each data link 2b1 to 2bN and respective output 31 to 3N shown in FIG. 1 can be represented as a bi-directional link, such that a line card associated with each buffer group 121 also includes both ingress and egress buffers. In such a case, line cards 102 can be viewed as both source locations and destination locations for transmitting data packets by way of the routing fabric. Similarly, the request, grant, and links 2a1 to 2aN also can be represented as bi-directional links.
With increasing capacity, the physical size of packet switches also is growing. At the same time, the duration of a single packet or cell (T=L/B, where L is the length of a packet in bits, and B the link rate in bits per second) is shrinking because although the line rate increases, packet sizes remain substantially constant. These trends directly imply a significant jump in the switch-internal round trip (RT) measured in packet times. This effect hits centrally-arbitrated input-queued switches doubly hard, because the minimum transit latency in such a switch is composed of two latencies: (1) the latency of submitting a request to the arbiter and waiting until the corresponding grant arrives, which includes the time-of-flight to and from the arbiter and the time to arbitrate; and (2) the latency of serialization/deserialization (SerDes), transmission, and time-of-flight to send the packet through the switch. Roughly speaking, these latencies amount to a minimum latency of 2 (RT) packet times, which is double that of a similar switch, but one having a buffered routing fabric.
Because these latencies have become a relevant issue only recently, they have received very little attention. In practice, a preferred solution has been to physically locate boards, such as line cards having input queues (typically organized in a VOQ fashion), close to the routing fabric (e.g., a switch core including a crossbar and arbiter). However, current packaging and power constraints prohibit placing a large number of line cards close to the switch core. As a result, such conventional arrangements cannot address ever-increasing demands for more bandwidth by simply increasing the number of line cards located at the routing fabric.
U.S. Pat. No. 6,647,019 to McKeown et al. attempts to increase the number of line cards, and thus the aggregate system bandwidth, by physically separating the line cards from the routing fabric. The bulk of buffering and processing is implemented on the physically remote line cards. FIG. 2A illustrates a system according to this approach.
As shown in FIG. 2, the system includes a switch core 210 and a plurality of line cards 202 physically located away from switch core 210. Each line card 202 includes an ingress VOQ buffer group (queue) 221 and an egress buffer 222. The switch core 210 includes a plurality of port modules 280 (i.e., “switch ports”), a parallel sliced self-routing crossbar-type fabric module 206 and a centralized arbiter module 207. Data packets are transmitted and received along data links 231 between the line cards 202 and switch ports 280, and along data links 203 between the switch ports 280 and the crossbar-type routing fabric 206. Each of the line cards 202 includes a buffer group 221 for storing packets being sent in the forward path and an egress buffer 222 for storing packets in the return path. Control messages are sent and received along control paths 232 between the line cards 202 and switch ports 280, and along control links 204 between the switch ports 280 and the arbiter module 207. The arbiter 207 determines a suitable configuration for each time slot and provides the configuration to the routing fabric 207 along configuration link 212. For each ingress port of the switch core 210, a small buffer 281 having VOQs is located close to the switch core 210 to minimize the RT between the VOQs 221 and the arbiter 207. This is achieved by way of a line card-to-switch (LCS) protocol, which enables lossless communication between the line cards 202 and the switch ports 280.
The main drawback of the McKeown et al. approach is that both the line cards 202 and the switch ports 280 contain buffers, even though only a small amount of buffering, namely, enough packets to cover one RT, is required in the switch ports 280. These buffered switch ports add cost, card space, electrical power, as well as latency (e.g., additional SerDes and buffering). They also duplicate functionality already present in the line cards 202.
Even using the approach described in the McKeown et al. patent, it would be difficult in practice to achieve a round-trip time between the switch ports 280 and the arbiter module 207 that is shorter than one cell time. Moreover, in the specific case of a switch fabric that comprises optical links from line cards to switch core (to cover the long distance from line cards to switch core) and an optical routing fabric, the switch ports 280 would require additional electrical-to-optical and optical-to-electrical conversions for buffering in electrical/CMOS chips because optical buffers are currently not practically or economically feasible. Such added conversion circuitry would significantly increase the cost and complexity of the system.
Another approach to reducing interconnection network latency, presented in W. J. Daily et al., “Principles and Practices of Interconnection Networks,” Morgan Kaufman, 2004, pages 316-318, involves “speculation with lookahead.” As described in Dally et al., a router's matching arbiter uses speculation with lookahead to look deeper into an ingress VOQ queue than the first member (head of line, or HoL) and allocate ahead of time some switch resources with the expectation (hope) that grants will be offered for those subsequent packets. This approach attempts to reduce the pipeline to as few stages as possible by enabling the router to perform some matching and setup tasks in parallel. While speculation with lookahead benefits queued packets that dwell in the ingress VOQ, and packets that have already made transmission requests that cannot be served immediately by the arbiter, it does not speed the transmission of packets whose transmission requests have not yet been received by the arbiter for consideration and/or packets that have just arrived at the ingress VOQ.
Further, speculation with lookahead addresses latency mostly in the arbiter algorithms and does not address the usually larger transmission time latency from the transmitter to the switch fabric. Earlier concepts of double and even triple speculation (e.g., see page 317 of Dally et al.), which rely on internal switch speedup and light switch loading to speculatively allocate even more of the switch's resources, fail in most applications. In many conventional strictly non-blocking switch fabrics, the internal fabric is internally partitioned into several successive switching stages. In double and triple speculation, those stages are incrementally set (allocated) for the speculative load. Only when the speedup is extreme or the load is light do speculations in these schemes regularly succeed in granting transmission through the entire multistage fabric. As the load increases, this approach to speculative allocation can hurt performance because it wastefully reserves resources that would be better allocated to successfully arbitrated requests.
Additionally, all of the above-described systems still suffer from the first RT latency in which the line card must wait until a grant arrives after submitting a request for an output resource.
Thus, there remains a need in the art for more efficient, less complex and lower cost ways to reduce latencies associated with routing fabrics in interconnect systems.