In general, the present invention concerns methods and apparatus for arbitrating contention for an output port of a switch (for switching ATM cells for example) or router (for routing TCP/IP packets for example).
The present invention concerns arbitrating port contention which often occurs when data is directed through a network or internetwork via switches or routers. Before addressing the arbitration techniques and apparatus of the present invention, a brief description of the emergence of packet switching is provided in xc2xa71.2.1 below. Popular data structures used when communicating data are described in xc2xa7xc2xa71.2.1.1.1 and 1.2.1.2.1 below. The basic elements and operations of switches or routers, which are used to direct data through a network or internetwork, are described in xc2xa7xc2xa71.2.1.1.2 and 1.2.1.2.2 below. The idea of prioritizing data communicated over a network or internetwork is introduced in xc2xa71.2.2 below. Finally, with all of the foregoing background in mind, the problem of arbitrating port contention in switches and routers, as well as shortcomings of known arbitration techniques, are described in xc2xa71.2.3 below.
xc2xa71.2.1 The Growth of Network and Internetwork Communications
Communications networks permit remote people or machines to communicate voice or data (also referred to as xe2x80x9ctrafficxe2x80x9d or xe2x80x9cnetwork trafficxe2x80x9d). These networks continue to evolve to meet new demands placed upon them. A brief history of communications networks, and the emergence of packet switching, is now presented.
The public switched telephone network (or xe2x80x9cPSTNxe2x80x9d) was developed to carry voice communications to permit geographically remote people to communicate with one another. Modems were then introduced, permitting computers to communicate data over the PSTN. Voice and modem communications over the PSTN use xe2x80x9ccircuit switchingxe2x80x9d. Circuit switching inherently involves maintaining a continuous real time communication channel at the full channel bandwidth between two points to continuously permit the transport of information throughout the duration of the call. Unfortunately, due to this inherent characteristic of circuit switching, it is inefficient for carrying xe2x80x9cburstyxe2x80x9d data traffic. Specifically, many services have relatively low information transfer ratesxe2x80x94information transfer occurs as periodic bursts. Bursty communications do not require full channel bandwidth at all times during the duration of the call. Thus, when circuit switched connection is used to carry bursty traffic, available communication bandwidth occurring between successive bursts is simply wasted.
Moreover, circuit switching is inflexible because the channel width is always the same. Thus, for example, a wide (e.g., 140 Mbit/second) channel would be used for all transmissions, even those requiring a very narrow bandwidth (e.g., 1 Kbit/second). In an attempt to solve the problem of wasted bandwidth occurring in circuit switching, multi-rate circuit switching was proposed. With multi-rate circuit switching, connections can have a bandwidth of a multiple of a basic channel rate (e.g., 1 Kbit/second). Although multi-rate circuit switching solves the problem of wasted bandwidth for services requiring only a narrow bandwidth, for services requiring a wide bandwidth, a number of multiple basic rate channels must be synchronized. Such synchronization becomes extremely difficult for wide bandwidth services. For example, a 140 Mbit/second channel would require synchronizing 140,000 1 Kbit/second channels. Moreover, multi-rate circuit switching includes the inherent inefficiencies of a circuit switch, discussed above, when bursty data is involved.
Multi-rate circuit switching having multiple xe2x80x9cbasic ratesxe2x80x9d has also been proposed. Unfortunately, the switch for multi-rate circuit switching is complex. Furthermore, the channel bandwidths are inflexible to meet new transmission rates. Moreover, much of the bandwidth might be idle when it is needed. Lastly, multiple basic rate circuit switching includes the inherent inefficiencies of a circuit switch, discussed above, when bursty data is involved.
In view of the above described problems with circuit switching, packet switched communications have become prevalent and are expected to be used extensively in the future. Two (2) communications protocolsxe2x80x94TCP/IP and ATMxe2x80x94are discussed in xc2xa7xc2xa71.2.1.1 and 1.2.1.2 below.
xc2xa71.2.1.1 Internets
In recent decades, and in the past five to ten years in particular, computers have become interconnected by networks by an ever increasing extent; initially, via local area networks (or xe2x80x9cLANsxe2x80x9d), and more recently via LANs, wide area networks (or xe2x80x9cWANsxe2x80x9d) and the Internet. In 1969, the Advanced Research Projects Agency (ARPA) of the U.S. Department of Defense (DoD) deployed Arpanet as a way to explore packet-switching technology and protocols that could be used for cooperative, distributed, computing. Early on, Arpanet was used by the TELNET application which permitted a single terminal to work with different types of computers, and by the file transfer protocol (or xe2x80x9cFTPxe2x80x9d) which permitted different types of computers to transfer files from one another. In the early 1970s, electronic mail became the most popular application which used Arpanet.
This packet switching technology was so successful, the ARPA applied it to tactical radio communications (Packet Radio) and to satellite communications (SATNET). However, since these networks operated in very different communications environments, certain parameters, such as maximum packet size for example, were different in each case. Thus, methods and protocols were developed for xe2x80x9cinternetworkingxe2x80x9d these different packet switched networks. This work lead to the transmission control protocol (or xe2x80x9cTCPxe2x80x9d) and the internet protocol (or xe2x80x9cIPxe2x80x9d) which became the TCP/IP protocol suite. Although the TCP/IP protocol suite, which is the foundation of the Internet, is known to those skilled in the art, it is briefly described in xc2xa71.2.1.1.1 below for the reader""s convenience.
xc2xa71.2.1.1.1 The Tcp/ip Protocol Stack
The communications task for TCP/IP can be organized into five (5) relatively independent layersxe2x80x94namely, (i) an application layer, (ii) a host-to-host layer, (iii) an Internet layer, (iv) a network access layer, and (v) a physical layer. The physical layer defines the interface between a data transmission device (e.g., a computer) and a transmission medium (e.g., twisted pair copper wires, optical fiber, etc.). It specifies the characteristics of the transmission medium and the nature of the signals, the data rate, etc. The network access layer defines the interface between an end system and the network to which it is attached. It concerns access to, and routing data across, a network. Frame Relay is an example of a network access layer. The internet layer (e.g., IP) defines interfaces between networks and provides routing information across multiple networks. The host-to-host layer (e.g., TCP) concerns assuring the reliability of the communication. Finally, the application layer provides an interface to support various types of end user applications (e.g., the simple mail transfer protocol (or xe2x80x9cSMTPxe2x80x9d) for e-mail, the file transfer protocol (or xe2x80x9cFTPxe2x80x9d), etc.).
Basically, each of the layers encapsulates, or converts, data in a high level layer. For example, referring to FIG. 1, user data 100 as a byte stream is provided with a TCP header 102 to form a TCP segment 110. The TCP segment 110 is provided with an IP header 112 to form an IP datagram 120. The IP datagram 120 is provided with a network header 122 to define a network-level packet 130. The physical layer converts the network-level packet to radio, electrical, optical (or other) signals sent over the transmission medium at a specified rate with a specified type of modulation.
The TCP header 102, as illustrated in FIG. 2, includes at least twenty (20) octets (i.e., 160 bits). Fields 202 and 204 identify ports at the source and destination systems, respectively, that are using the connection. Values in the sequence number 206, acknowledgement number 208 and window 216 files are used to provide flow and error control. The value in the checksum field 218 is used to detect errors in the TCP segment 110.
FIGS. 3A and 3B illustrate two (2) alternative IP headers 112 and 112xe2x80x2, respectively. Basically, FIG. 3A depicts the IP protocol (Version 4) which has been used. FIG. 3B depicts a next generation IP protocol (Version 6) which, among other things, provides for more source and destination addresses.
More specifically, referring to FIG. 3A, the four (4) bit version field 302 indicates the version number of the IP, in this case, version 4. The four (4) bit Internet header length field 304 identifies the length of the header 112 in 32-bit words. The eight (8) bit type of service field 306 indicates the service level that the IP datagram 120 should be given. The sixteen (16) bit total length field 308 identifies the total length of the IP datagram 120 in octets. The sixteen (16) bit identification field 310 is used to help reassemble fragmented user data carried in multiple packets. The three (3) bit flags field 312 is used to control fragmentation. The thirteen (13) bit fragment offset field 314 is used to reassemble a datagram 120 that has become fragmented. The eight (8) bit time to live field 316 defines a maximum time that the datagram is allowed to exist within the network it travels over. The eight (8) bit protocol field 318 defines the higher-level protocol to which the data portion of the datagram 120 belongs. The sixteen (16) bit header checksum field 320 permits the integrity of the IP header 112 to be checked. The 32 bit source address field 322 contains the IP address of the sender of the IP datagram 120 and the 32 bit destination address field 324 contains the IP address of the host to which the IP datagram 120 is being sent. Options and padding 326 may be used to describe special packet processing and/or to ensure that the header 112 takes up a complete set of 32 bit words.
Referring to FIG. 3B, the four (4) bit version field 302 indicates the version number of the IP, in this case, version 6. The four (4) bit priority field 328 enables a sender to prioritize packets sent by it. The 24 bit flow label field 330 is used by a source to label packets for which special handling is requested. The sixteen (16) bit payload length field 332 identifies the size of the data carried in the packet. The eight (8) bit next header field 334 is used to indicate whether another header is present and if so, to identify it. The eight (8) bit hop limit field 336 serves to discard the IP datagram 120 if a hop limit (i.e., the number of times the packet is routed) is exceeded. Also provided are 128 bit source and destination address fields 322xe2x80x2 and 324xe2x80x2, respectively.
Having described the TCP/IP protocol suite, the routing of a TCP/IP packet is now described in xc2xa72.2.1.1.2 below.
xc2xa71.2.1.1.2 Routing Tcp/ip Packets
A TCP/IP packet is communicated over the Internet (or any internet or intranet) via routers. Basically, routers in the Internet use destination address information (Recall fields 324 and 324xe2x80x2) to forward packets towards their destination. Routers interconnect different networks. More specifically, routers accept incoming packets from various connected networks, use a look-up table to determine a network upon which the packet should be placed, and routes the packet to the determined network. The router may buffer incoming packets if the networks are providing packets faster than it can route them. Similarly, the router may buffer outgoing packets if the router provides outgoing packets faster than the determined networks can accept them. The router may also arbitrate output port contention which is performed by the arbitration technique of the present invention. In some highspeed routers, packets are segmented into cells having a fixed data length before they are routed.
FIG. 4, which includes FIGS. 4A through 4C, illustrates the communication of data from a sender, to a receiver, using the TCP/IP protocol suite. Referring first to FIG. 4A, an application protocol 402 prepares a block of data (e.g., an e-mail message (SMTP) a file (FTP), user input (TELNET), etc.) 100 for transmission. Before the data 100 are sent, the sending and receiving applications agree on a format and encoding and agree to exchange data. If necessary the data are converted (character code, compression, encryption, etc.) to a form expected by the destination.
The TCP layer 404 may segment the data block 100, keeping track of the sequence of the blocks. Each TCP segment 110 includes a header 102 containing a sequence number (recall field 206) and a frame check sequence to detect errors. A copy of each TCP segment is made so that, if a segment is lost or damaged, it can be retransmitted. When an acknowledgement of safe receipt is received from the receiver, the copy of the segment is erased.
The IP layer 406 may break a TCP segment into a number of datagrams 120 to meet size requirements of networks over which the data will be communicated. Each datagram includes the IP header 112.
A network layer 408, such as frame relay for example, may apply a header and trailer 122 to frame the datagram 120. The header may include a connection identifier and the trailer may contain a frame check sequence for example. Each frame 130 is then transmitted, by the physical layer 410, over the transmission medium as a sequence of bits.
FIG. 4B illustrates the operation of TCP/IP at a router in the network. The physical layer 412 receives the incoming signal 130 from the transmission medium and interprets it as a frame of bits. The network (e.g., frame relay) layer 414 removes the header and trailer 122 and processes them. A frame check sequence may be used for error detection. A connection number may be used to identify the source. The network layer 414 then passes the IP datagram 120 to the IP layer 418.
The IP layer examines the IP header 112 and makes a routing decision (Recall the destination address 324, 324xe2x80x2.). A local line control (or xe2x80x9cLLCxe2x80x9d) layer 420 uses a simple network management protocol (or xe2x80x9cSNMPxe2x80x9d) and adds a header 450 which contains a sequence number and address information. Another network layer 422 (e.g., media access control (or xe2x80x9cMACxe2x80x9d)) adds a header and trailer 460. The header may contain address information and the trailer may contain a frame check sequence. The physical layer 424 then transmits the frame 150 over another transmission medium.
FIG. 4C illustrates the operation of TCP/IP at a receiver. The physical layer 432 receives the signal from the transmission medium and interprets it as a frame of bits. The network layer 434 removes the header and trailer 460 and processes them. For example, the frame check sequence in the trailer may be used for error detection. The resulting packet 140 is passed to the transport layer 436 which processes the header 450 for flow and error control. The resulting IP datagram 120 is passed to the IP layer 438 which removes the header 112. Frame check sequence and other control information may be processed at this point.
The TCP segment 110 is then passed to the TCP layer 440 which removes the header 102 and may check the frame check sequence (in the event of a match, the match is acknowledged and in the event of a mismatch, the packet is discarded). The TCP layer 440 then passes the data 100 to the application layer 442. If the user data was segmented (or fragmented), the TCP layer 440 reassembles it. Finally, the application layer 442 performs any necessary transformations, such as decompression and decryption for example, and directs the data to an appropriate area of the receiver, for use by the receiving application.
xc2xa71.2.1.2 High Speed Networks
As discussed in xc2xa71.2.1 above, there has been a trend from circuit switched networks towards packet switched networks. For example, packet switched communications presently appear to be the preferred mode of communication over a Broadband-Integrated Services Digital Network (or xe2x80x9cB-ISDNxe2x80x9d) service. Packet switching includes normal packet switching (e.g., X.25) and fast packet switching (e.g., Asynchronous Transfer Mode or xe2x80x9cATMxe2x80x9d). Normal packet switching assumes certain errors at each data link are probable enough to require complex protocols so that such errors can be controlled at each link. Link errors were a valid assumption and concern at one time. However, today data links are very reliable such that the probability of errors being introduced by data links are no longer of any great concern. Hence, fast packet switching is becoming more prominent. The ATM protocol is discussed in xc2xa71.2.1.2.1 below.
xc2xa71.2.1.2.1 The Asynchronous Transfer Mode (Atm) Protocol
Since data links are very reliable and the probability of errors being introduced by data links are no longer of any great concern, ATM fast packet switching does not correct errors or control flow within the network (i.e., on a link-by-link basis). Instead, ATM is only concerned with three types of errors; namely bit errors, packet loss, and packet insertion. Bit errors are detected and/or corrected using end-to-end protocols. Regarding packet loss and insertion errors, ATM only uses prophylactic actions when allocating resources during connection set-up. That is, ATM operates in a connection-oriented mode such that when a connection is requested, a line terminal first checks whether sufficient resources (i.e., whether sufficient bandwidth and buffer area) are available. When the transfer of information is complete, the resources are xe2x80x9creleasedxe2x80x9d (i.e., are made available) by the line terminal. In this way, ATM reduces the number of overhead bits required with each cell, thereby permitting ATM to operate at high data rates.
The ATM protocol transfers data in discrete sized chunks called xe2x80x9ccellsxe2x80x9d. The use of fixed sized cells simplifies the processing required at each network node (e.g., switch) thereby permitting ATM to operate at high data rates. The structure of ATM cells is described in more detail below.
Finally, the ATM protocol permits multiple logical (or xe2x80x9cvirtualxe2x80x9d) connections to be multiplexed over a single physical interface. As shown in FIG. 5, logical connections in ATM are referred to as virtual channel connections (or xe2x80x9cVCCsxe2x80x9d) 510. A VCC 510 is the basic unit of switching in an ATM network. A VCC 510 is established between two end users, through the network. A variable-rate, full-duplex flow of ATM cells may be exchanged over the VCC 510. VCCs 510 may also be used for control signaling, network management and routing.
A virtual path connection (or xe2x80x9cVPCxe2x80x9d) 520 is a bundle of VCCs 510 that have the same end points. Accordingly, all of the cells flowing over all VCCs 510 in a single VPC 520 may be switched along the same path through the ATM network. In this way, the VPC 520 helps contain network control costs by grouping connections sharing common paths through the network. That is, network management actions can be applied to a small number of virtual paths 520 rather than a large number of individual virtual channels 510.
Finally, FIG. 5 illustrates that multiple virtual paths 520 and virtual channels 510 (i.e., logical connections) may be multiplexed over a single physical transmission path 530.
FIG. 6 illustrates the basic architecture for an interface between a user and a network using the ATM protocol. The physical layer 610 specifies a transmission medium and a signal-encoding (e.g., data rate and modulation) scheme. Data rates specified at the physical layer 610 may be 155.52 Mbps or 622.08 Mbps, for example. The ATM layer 620 defines the transmission of data in fixed sized cells and also defines the use of logical connections, both introduced above. The ATM adaptation layer 630 supports information transfer protocols not based on ATM. It maps information between a high layer 640 and ATM cells.
Recall that the ATM layer 620 places data in fixed sized cells (also referred to as a packet). An ATM packet includes a header field (generally five (5) bytes) and a payload (or information) field (generally 48 bytes). The main function of the header is to identify a virtual connection to guarantee that the ATM packet is properly routed through the network. Switching and/or multiplexing is first performed on virtual paths and then on virtual channels. The relatively short length of the payload or information field reduces the size required for internal buffers at switching nodes thereby reducing delay and delay jitter.
More specifically, FIG. 7A illustrates an ATM cell 700 having a header 710 as formatted at a user-network interface, while FIG. 7B illustrates the ATM cell 700xe2x80x2 having a header 710xe2x80x2 as formatted internal to the network. Referring first to the header 710 as formatted at the user-network interface, a four (4) bit generic flow control field 712 may be used to assist an end user in controlling the flow of traffic for different qualities of service. The eight (8) bit virtual path identifier field 714 contains routing information for the network. Note that this field 714xe2x80x2 is expanded to twelve (12) bits in header 710xe2x80x2 as formatted in the network. In both headers 710 and 710xe2x80x2, a sixteen (16) bit virtual channel identifier field 716 contains information for routing the cell to and from the end users. A three (3) bit payload type field 718 indicates the type of information in the 48 octet payload portion 750 of the packet. (The coding of this field is not particularly relevant for purposes of the present invention.) A one (1) bit cell loss priority field 720 contains information to let the network know what to do with the cell in the event of congestion. A value of 0 in this field 720 indicates that the cell is of relatively high priority and should not be discarded unless absolutely necessary. A value of 1 in this field indicates that the network may discard the cell. Finally, an eight (8) bit header error control field 722 contains information used for error detection and possibly error correction as well. The remaining 48 octets 750 define an information field.
Fast packet switching, such as ATM switching, has three main advantages. First ATM switching is flexible and is therefore safe for future transfer rates. Second, no resources are specialized and consequently, all resources may be optimally shared. Finally, ATM switches permit economies of scale for such a universal network.
xc2xa71.2.1.2.2 Switches
ATM cells are directed through a network by means of a series of ATM switches. An ATM switch must perform three basic functions for point-to-point switching; namely, (i) routing the ATM cell, (ii) updating the virtual channel identifier (VCI) and virtual path identifier (VPI) in the ATM cell header (Recall fields 714, 714xe2x80x2 and 716xe2x80x2.), and (iii) resolving output port contention. The first two functions, namely routing and updating, are performed by a translation table belonging to the ATM switch. The translation table converts an incoming link (input port) and VCI/VPI to an outgoing link (output port) and VCI/VPI. Resolving output port contention (which may be performed by the arbitration technique of the present invention) is discussed in xc2xa71.2.3 below.
Thus, conceptually, referring to FIG. 8, an ATM switch 800 may include input port controllers 810 for accepting ATM cells from various physical (or logical) links (Recall FIG. 5.), a switching fabric 820 for forwarding cells to another link towards their destination, and output port controllers 830 for buffering ATM cells to be accepted by various physical (or logical) links. A control unit 840 may be used to coordinate the operations o the input port controllers 810, the output port controllers 830 and the switching fabric 820. An exemplary, scalable, ATM switch is disclosed in U.S. Pat. Nos. 5,724,351 and 5,790,539 (each of which is incorporated herein by reference).
xc2xa71.2.2 The Need to Consider Different Types of Trafficxe2x80x94Priority
Different applications place different demands on communications networks. In particular, a certain application may require that its traffic be communicated (i) with minimum delay, (ii) at a fast rate, (iii) with maximum reliability, and/or (iv) to minimize communications (service) cost. For example, people would not tolerate much delay in their voice communications during a telephone call. High definition video requires a fast rate, or a high bandwidth, as well as low jitter, or delay variations. However, video communications may be able to tolerate some data corruption or loss to the extent that such losses are imperceptible or not annoying to people. The communications of important data, on the other hand, may tolerate delay, but might not tolerate data loss or corruption. Finally, an application may request that low priority data be communicated at a minimum cost. To the extent that the network traffic of an application does not have xe2x80x9cspecialxe2x80x9d requirements, it should be communicated with normal service.
Thus, many applications require a guaranteed quality of service (or xe2x80x9cQoSxe2x80x9d) from a network provider. The network provider, in turn, may see guaranteeing QoS as a way to add value to their network and increase revenues. TCP/IP based internetworks and ATM based networks are envisioned as carrying many different types of data for many different applications which have different needs. (Recall the xe2x80x9cType of Servicexe2x80x9d field 306 of the internet protocol packet (version 4), the xe2x80x9cpriorityxe2x80x9d field 328 of the internet protocol packet (version 6), and xe2x80x9cgeneric flow controlxe2x80x9d field 712 of the ATM cell.)
xc2xa71.2.3 Contention
As introduced above with reference to FIG. 8, a packet switch includes input and output ports interconnected by a switch fabric. The switch fabric can use shared-medium (e.g., bus), shared-memory, and space-division (e.g., crossbar) architecture. (See, e.g., the article, F. A. Tobagi, xe2x80x9cFast Packet Switch Architectures for Broadband Integrated Services Digital Networksxe2x80x9d, Proceedings of the IEEE, Vol. 78, No. 1, pp. 133-167 (January 1990).) The function of a packet switch is to transfer packets from the input ports to the appropriate output ports based on the addresses contained within the packet headers. In practice, the variable length packets are usually broken into fixed sized cells (not necessarily 53 bytes) before being transmitted across the switch fabric. The cells are then reassembled at the output of the switch. (See, e.g., the article, T. Anderson, et al., xe2x80x9cHigh Speed Switch Scheduling for Local Area Networksxe2x80x9d, ACM Trans. Computer Systems, pp. 319-352 (November 1993); hereafter referred to as xe2x80x9cthe Anderson articlexe2x80x9d.) Since multiple packets from different input ports could be destined for the same output port at the same time (referred to as xe2x80x9coutput port contentionxe2x80x9d or simply xe2x80x9ccontentionxe2x80x9d), a switch arbitration or scheduling algorithm is needed to choose from among the contending packets, the one packet preferred at that time slot, provide a grant to the input port corresponding to the preferred packet, and configure the switch fabric to transfer the packet.
An arbiter is used to resolve output port contention among two or more packets or cells destined for the same output port. The arbiter chooses a packet or cell which xe2x80x9cwinsxe2x80x9d contention (i.e., which is applied to the output port). Other packets or cells contending for the output port xe2x80x9closexe2x80x9d contention (i.e., they must wait before being applied to the output port).
Reducing the arbitration time can significantly reduce the packet delay across a switch, thus enabling high speed implementation.
xc2xa71.2.3.1 Buffering to Alleiviate Contention
To prevent the packets or cells losing contention for the output port from being lost, buffering is required. There are three basic buffering strategies; namely, pure input queuing, pure output queuing and central queuing. These buffering techniques and their relative advantages and disadvantages are described below.
xc2xa71.2.3.1.1 Input Port Buffering
Pure input queuing provides a dedicated buffer at each input port. Arbitration logic is used to decide which input port buffer will be next served. The arbitration logic may be simple (e.g., round robin in which the inlet buffers are served in order, or random in which the inlet buffers are served randomly) or complex (e.g., state dependent in which the most filled buffer is served next, or delay dependent in which the globally oldest cell is served next).
Unfortunately, with input queuing, a packet or cell in the front of the queue waiting for an occupied output channel to become available may block other packets or cells behind it which do not need to wait. This is known as head-of-line (or xe2x80x9cHOLxe2x80x9d) blocking. A post office metaphor has been used to illustrate head-of-line (HOL) blocking in the book, M. dePrycker, Asynchronous Transfer Mode: Solution for Broadband ISDN, pp. 133-137 (Ellis Horwood Ltd., 1991). In the post office metaphor, people (representing cells) are waiting in a line (representing an input buffer) for either a stamp window (a first output port) or an airmail window (a second output port). Assume that someone (a cell) is already at the stamp window (the first output port) and that the first person in the line (the HOL of the input buffer) needs to go to the stamp window (the first output port). Assume further that no one is presently at the airmail window (the second output port) and that the second and third people in line (cells behind the HOL cell in the input queue) want to go to the airmail window (the second output port). Although the airmail window (second output port) is available, the second and third people (cells behind the HOL cell) must wait for the first person (the HOL cell) who is waiting for the stamp window (the first output port) to become free. Therefore, as the post office metaphor illustrates, the head-of-line (HOL) cell waiting for an output port to become free often blocks cells behind it which would otherwise not have to wait. Simulations have shown that such head-of-line (HOL) blocking decreases switch throughput.
When input buffering is used, a simple round robin scheme is generally adopted in an arbiter to ensure a fair arbitration among the inputs. Imagine there is a token circulating among the inputs in a certain ordering. The input that is granted by the arbiter is said to grasp the token, which represents the grant signal. The arbiter is responsible for moving the token among the inputs that have request signals. The traditional arbiters handle all inputs together and the arbitration time is proportional to the number of inputs. As a result, the switch size or capacity is limited given a fixed amount of arbitration time.
An input-buffered crossbar switch with centralized contention resolution does not scale well for a large number of switch ports due to the centralized nature of its arbiter. Although distributed output contention resolution in a multicast packet switch may be achieved by using an arbiter for each output port, traditional arbiters handle all inputs together and the arbitration time is proportional to the number of inputs. As a result, the switch size or capacity is limited given a fixed amount of arbitration time. A crossbar switch architecture with internal speedup and distributed contention resolution was proposed recently in the article, K. Genda et al, xe2x80x9cTORUS: Terabit-per-second ATM Switching System Architecture on Distributed Internal Speed-Up ATM Switch,xe2x80x9d IEEE J. Select Areas Commun., Vol. 15, No. 5, pp. 817-29 (Jun. 5, 1997) to achieve a capacity of Terabit per second, but its contention resolution algorithm favors some of the connections and is thus unfair.
xc2xa71.2.3.1.2 Output Port Buffering
Pure output buffering solves the head-of-line (HOL) blocking problems of pure input buffering by providing only the output ports with buffers. Since the packets or cells buffered at an output port are output in sequence (i.e., first in, first out, or xe2x80x9cFIFOxe2x80x9d), no arbitration logic is required. In the post office metaphor, the stamp window (first output port) has its own line (first output buffer) and the airmail window (second output port) has its own line (second output buffer). Since no arbitration logic is required, the delay through the switch is said to have an absolute bound.
Although pure output buffering clearly avoids HOL blocking that may occur in pure input port buffering, it does have some disadvantages. Specifically, to avoid potential cell loss, assuming N input ports, the system must be able to write N ATM cells into any one of the queues (or output buffers) during one cell time (i.e., within 2.8 microseconds, where 2.8 microseconds is (53 bytes*8 bits/byte)/155.52 Mbit/second. Such a high memory write rate is necessary because it is possible that each of the ATM cells arriving at each of the input ports will require the same output port. This requirement on the memory speed of the output buffer becomes a problem as the size of the switch (i.e., as N) increases. Accordingly, for a 1024-by-1024 switch (i.e., a switch having 1024 inputs and 1024 outputs), pure output buffering is not feasible because the speed of the output port buffers would have to be fast enough to handle 1024 cells during each time slot.
Speedup (c) of the switch fabric is defined as the ratio of the switch fabric bandwidth and the bandwidth of the input links. (Unless otherwise stated, it will be assumed that every input/output link has the same capacity.) An output queued switch is the one where the speedup is greater than or equal to the number of input ports (cxe2x89xa7n). Since each output port can receive n incoming packets in a time slot, there is no output contention as discussed above. The switch has desirably zero input queuing delay without considering store-and-forward implementation. Unfortunately, an output queued switch is limited because the output port memory speed may limit it from buffering all possible input packets, particularly when the number of input ports is relatively large.
xc2xa71.2.3.1.3 Central Queuing
Central queuing includes a queue not assigned to any inlet (input port) or outlet (output port). Each outlet will select cells destined for it in a first in, first out (FIFO) manner. However, the outlets must be able to know which cells are destined for them. Moreover, the read and write discipline of the central queue cannot be a simple FIFO because ATM cells destined for different outlets are all merged into a single queue. Turning again to the post office metaphor, a single line (central queue) of people (ATM cells) are waiting to visit the stamp window (a first output port) or the airmail window (a second output port). As a window opens up (i.e., as an output port becomes available), a server searches the line (central queue) for the next person (ATM cell) needing the available window (requiring the available output port). The server brings that person (ATM cell) to the open window (available output port) regardless of whether the person (the ATM cell) is at the front of the line (HOL). As the post office metaphor illustrates, the central queue requires complex memory management system given the random accessibility required. Of course, the memory management system becomes more complex and cumbersome when the number of output ports (i.e., the size of the switch) increases.
xc2xa71.2.3.1.4 Input and Output Port Buffering
An input-output queued switch will result by an input queued switch using a speedup of greater than one (c greater than 1). A recent study shows that it is possible to achieve 100% switch throughput with a moderate speedup of c=2. (See, e.g., the technical publication, R. Guerin, et al., xe2x80x9cDelay and Throughput Performance of Speed-Up Input-Queuing Packet Switchesxe2x80x9d, IBM Research Report RC 20892, (June 1997).) Since each output port can receive up to c cells in a time slot (each input port can send up to c cells during the same time), the requirement on the number input-output matching found in each arbitration cycle (c cycles in a time slot) may possibly be relaxed, enabling simpler arbitration schemes. On the other hand, the arbitration time is reduced c times, making the time constraint for arbitration more stringent.
An input queued switch has no speedup (i.e., the incoming lines, switching fabric, and outgoing lines operate at the same rate) and thus is relatively simple to implement. However, as described above, it suffers the well-known problem of head-of-line (HOL) blocking (See, e.g., the article, M. Karol, et al., xe2x80x9cInput Versus Output Queuing on a Space Division Switchxe2x80x9d, IEEE Trans. Comm., Vol. 35, No. 12, pp. 1347-1356 (1987).), which could limit its maximum throughput to about 58% when it uses first-in-first-out (FIFO) at each input port and operates under uniform traffic (i.e., the output address of each packet is independently and equally distributed among every output). Many techniques have been suggested to reduce the HOL blocking, for example, by considering the first K cells in the FIFO, where K greater than 1. (See, e.g., the article, M. Karol, et al., xe2x80x9cQueuing in High-Performance Packet-Switchingxe2x80x9d, IEEE J. Select. Area in Comm., Vol. 6, pp. 1587-1597 (December 1988).) The HOL blocking can be eliminated entirely by using virtual output queuing (VOQ), where each input maintains a separate queue for each output. (See, e.g., the article, Y. Tamir, et al., xe2x80x9cHigh Performance Multi-Queue Buffers for VLSI Communication Switchesxe2x80x9d, Proc. of 15th Ann. Symp. on Comp. Arch., pp. 343-354 (June 1988).) Referring to FIG. 9 for example, each input queue 910 maintains a separate queue 912 for each output port 930.
To achieve 100% throughput in an input-queued switch with virtual output queues, sophisticated arbitration is used to schedule packets between various inputs and outputs. This may be accomplished by applying bipartite graph matching (See, e.g., the Anderson article.) in which each output must be paired with at most one input that has a cell destined for that output; a complex procedure to implement in hardware. It has been shown that an input buffered switch with virtual output queues can provide asymptotic 100% throughput using a maximum matching (a match that pairs the maximum number of inputs and outputs together. There is no other pairing that matches more inputs and outputs (See, e.g., the Anderson article.) algorithm. (See, e.g., the article, N. McKeown et al., xe2x80x9cAchieving 100% Throughput in an Input-Queued Switchxe2x80x9d, Proc. IEEE INFOCOM, pp. 296-302 (1996).) However, the complexity of the best known maximum matching algorithm is exponential (i.e., O(n2.5)) (See, e.g., the technical publication, R. Tarjan, Data Structures and Network Algorithms, Bell Labs (1983).), which is too high for high speed implementation for relatively large n. In practice, a number of maximal matching (a match for which pairings cannot be trivially added; each node is either matched or has no edge to an unmatched node (See, e.g., the Anderson article.)).
Algorithms for matching input and output nodes have been proposed, such as parallel iterative matching (PIM) (See, e.g., the Anderson article.) and iterative round robin matching (iSLIP) (See, e.g., the McKeown article.). For example, in the technique discussed in the McKeown article, each input port sends multiple requests to different output portsxe2x80x94one for each head of line cell of in each of the virtual output queues. Then, at each output port, an arbiter chooses an input port which wins contention and sends a grant signal to the corresponding input. Since, an input port may receive more than one grant signal, an arbiter at the input port chooses one and sends an acceptance signal to the corresponding output port. Although the iSLIP technique disclosed in the McKeown article is advantageous in that the arbiters become desynchronized, it does require a lot of communications between the input and output ports. Moreover, each of the arbitrations is on the order of the number of output ports N.
Thus, better arbitration methods, and apparatus for implementing such methods, are needed.
xc2xa71.2.4 Needs not Met by Known Contention Resolution Schemes
As just stated above, there are several methods which perfectly emulate purely output queuing under a moderate speedup factor (2-4) so that ideal packet scheduling can be realized at outputs. These methods consider the states of output packet scheduling as the arbitration priority, and iterative stable matching is needed to ensure perfect emulation. While these methods might be the future choice for perfect scheduling and providing delay bounds, their time complexities of at least the order of the number of output ports N matching iterations is infeasible with existing electronic technology for a Terabit per second switch. Together with some sorting time required to emulate the desired fair queuing, the total time budget can be as large as implementing N simple arbitrations. The enormous state maintenance and the large amount of state information exchange between inputs and outputs also make it impractical to implement perfect emulation of fair queuing with stable matching.
In the present invention, the arbitration may be separated from the output packet scheduling to keep the implementation and time complexities reasonable. Although no absolute delay bounds can be obtained when the arbitration is separated from the output scheduling and perfect emulation of output queuing cannot be realized, delay bounds are still attainable in the statistical sense. A delay bound is said statistical if the portion of packets with an undesired delay is bounded by an acceptable probability. Relaxing the delay bound requirement from absolute bounds to statistical bounds should not cause a significant performance degradation because, even if the delay bound is absolutely guaranteed, some cells may still be lost due to buffer overflow and other reasons. The statistical delay bound can be achieved and the exceptional probability can be controlled to be as small as the packet loss rate under some speedup factors and certain traffic circumstances.
The present invention may use a novel dual round robin (DRR) arbitration scheme in which input selection and output contention resolution are separately handled by two independent sets of round-robin arbiters. Among the virtual output queues (VOQs) maintained at each input, a cell is selected in a round-robin manner to be the request for output contention resolution. The selected cell keeps contending until winning a token, and then the next cell is selected. Compared with first-in-first-out (FIFO) input queuing, the novel dual round robin arbitration scheme reduces the destination correlation of the cell arrival sequence for output contention resolution and thus, significantly improves the delay performance of bursty traffic.
The present invention may meet stringent arbitration time constraints to resolve output port contention by using a novel token tunneling arbitration scheme for output port contention resolution. This scheme is a variation of the ring reservation method proposed in the article, B. Bingham et al, xe2x80x9cReservation-Based Contention Resolution Mechanism for Batcher-Banyan Packet Switchesxe2x80x9d, Electronic Letters, Vol. 24, No. 13, pp. 772-3 (June 1988) and is fair. The arbitration time of the ring reservation method is proportional to the number of switch ports. With token tunneling arbitration, it is possible to reduce the arbitration time to the order of the square root of the number of ports. The ring reservation method proposed in the Bingham article is implemented using sequential logic. On the other hand, the token tunneling arbitration scheme of the present invention is implemented with combinational logic that makes it even faster. Thus, the present invention has a comparable delay in the basic arbitration unit as the bi-directional arbiter described in the article, K. Genda et al, xe2x80x9cA 160 Gb/s ATM Switching System Using an Internal Speed-Up Crossbar Switchxe2x80x9d, Proc. GLOBECOM""94, pp. 123-33 (November 1994). However, the overall arbitration delay is much smaller with the present invention because of the token tunneling method. Furthermore, the present invention may be implemented with only two pins per output port, compared to six in the switch discussed in the Genda article. Crossbar chips are generally pad-limited and therefore the number of pins required per port determines the number of ports that can be accommodated in a single chip.