xc2xa71.1 Field of the Invention
In general, the present invention concerns congestion control and traffic management in networks and inter-networks operating at relatively high data rates and carrying information which may have differing quality of service (or xe2x80x9cQoSxe2x80x9d) requirements. In particular, the present invention concerns methods and apparatus for fairly servicing queues at an output port of a switch (for switching ATM packets for example) or router (for routing TCP/IP packets for example).
xc2xa71.2 Related Art
xc2xa71.2.1 The Growth of Network and Internetwork Communications
Communications networks permit remote people or machines to communicate voice or data (also referred to as xe2x80x9ctrafficxe2x80x9d or xe2x80x9cnetwork trafficxe2x80x9d). These networks continue to evolve to meet new demands placed upon them. Different applications place different demands, often on the same network. In particular, a certain application may require that its traffic be communicated (i) with minimum delay, (ii) at a fast rate, (iii) with maximum reliability, and/or (iv) to minimize communications (service) cost. For example, people would not tolerate much delay in their voice communications during a telephone call. High definition video requires a fast rate, or a high bandwidth, as well as low jitter, or delay variations. However, video communications may be able to tolerate some data corruption or loss to the extent that such losses are imperceptible or not annoying to people. The communications of important data, on the other hand, may tolerate delay, but might not tolerate data loss or corruption. Finally, an application may request that low priority data be communicated at a minimum cost. To the extent that the network traffic of an application does not have xe2x80x9cspecialxe2x80x9d requirements, it should be communicated with normal service.
Having introduced the fact that different applications may place different requirements on a communications network, a brief history of communications networks, and the emergence of packet switching, is now presented.
The public switched telephone network (or xe2x80x9cPSTNxe2x80x9d) was developed to carry voice communications to permit geographically remote people to communicate. Modems then came along, permitting computers to communicate data over the PSTN. Voice and modem communications over the PSTN use xe2x80x9ccircuit switchingxe2x80x9d. Circuit switching inherently involves maintaining a continuous real time communication channel at the full channel bandwidth between two points to continuously permit the transport of information throughout the duration of the call. Unfortunately, due to this inherent characteristic of circuit switching, it is inefficient for carrying xe2x80x9cburstyxe2x80x9d data traffic. Specifically, many services have relatively low information transfer ratesxe2x80x94information transfer occurs as periodic bursts. Bursty communications do not require full channel bandwidth at all times during the duration of the call. Thus, when circuit switched connection is used to carry bursty traffic, available communication bandwidth occurring between successive bursts is simply wasted.
Moreover, circuit switching is inflexible because the channel width is always the same. Thus, for example, a wide (e.g., 140 Mbit/second) channel would be used for all transmissions, even those requiring a very narrow bandwidth (e.g., 1 Kbit/second). In an attempt to solve the problem of wasted bandwidth occurring in circuit switching, multi-rate circuit switching was proposed. With multi-rate circuit switching, connections can have a bandwidth of a multiple of a basic channel rate (e.g., 1 Kbit/second). Although multi-rate circuit switching solves the problem of wasted bandwidth for services requiring only a narrow bandwidth, for services requiring a wide bandwidth, a number of multiple basic rate channels must be synchronized. Such synchronization becomes extremely difficult for wide bandwidth services. For example, a 140 Mbit/second channel would require synchronizing 140,000 1 Kbit/second channels. Moreover, multi-rate circuit switching includes the inherent inefficiencies of a circuit switch, discussed above, when bursty data is involved.
Multi-rate circuit switching having multiple xe2x80x9cbasic ratesxe2x80x9d has also been proposed. Unfortunately, the switch for multi-rate circuit switching is complex. Furthermore, the channel bandwidths are inflexible to meet new transmission rates. Moreover, much of the bandwidth might be idle when it is needed. Lastly, multiple basic rate circuit switching includes the inherent inefficiencies of a circuit switch, discussed above, when bursty data is involved.
In view of the above described problems with circuit switching, packet switched communications have become prevalent and are expected to be used extensively in the future. Two communications protocolsxe2x80x94TCP/IP and ATMxe2x80x94are discussed in xc2xa7xc2xa71.2.1.1 and 1.2.1.2 below.
xc2xa71.2.1.1 Internets
In recent decades, and in the past five to ten years in particular, computers have become interconnected by networks by an ever increasing extent; initially, via local area networks (or xe2x80x9cLANsxe2x80x9d), and more recently via LANs, wide area networks (or xe2x80x9cWANsxe2x80x9d) and the Internet. In 1969, the Advanced Research Projects Agency (ARPA) of the U.S. Department of Defense (DoD) deployed Arpanet as a way to explore packet-switching technology and protocols that could be used for cooperative, distributed, computing. Early on, Arpanet was used by the TELNET application which permitted a single terminal to work with different types of computers, and by the file transfer protocol (or xe2x80x9cFTPxe2x80x9d) which permitted different types of computers to transfer files from one another. In the early 1970s"", electronic mail became the most popular application which used Arpanet.
This packet switching technology was so successful, the ARPA applied it to tactical radio communications (Packet Radio) and to satellite communications (SATNET). However, since these networks operated in very different communications environments, certain parameters such as maximum packet size, were different in each case. Thus, methods and protocols were developed for xe2x80x9cinternetworkingxe2x80x9d these different packet switched networks. This work lead to the transmission control protocol (or xe2x80x9cTCPxe2x80x9d) and the internet protocol (or xe2x80x9cIPxe2x80x9d) which became the TCP/IP protocol suite. Although the TCP/IP protocol suite, which is the foundation of the Internet, is known to those skilled in the art, it is briefly described in xc2xa71.2.1.1.1 below for the reader""s convenience.
xc2xa71.2.1.1.1 The TCP/IP Protocol Stack
The communications task for TCP/IP can be organized into five (5) relatively independent layersxe2x80x94namely, (i) an application layer, (ii) a host-to-host layer, (iii) an Internet layer, (iv) a network access layer, and (v) a physical layer. The physical layer defines the interface between a data transmission device (e.g., a computer) and a transmission medium (e.g., twisted pair copper wires, optical fiber, etc.). It specifies the characteristics of the transmission medium and the nature of the signals, the data rate, etc. The network access layer defines the interface between an end system and the network to which it is attached. It concerns access to, and routing data across, a network. Frame Relay is an example of a network access layer. The internet layer (e.g., IP) defines interfaces between networks and provides routing information across multiple networks. The host-to-host layer (e.g., TCP) concerns assuring the reliability of the communication. Finally, the application layer provides an interface to support various types of end user applications (e.g., the simple mail transfer protocol (or xe2x80x9cSMTPxe2x80x9d) for e-mail, the file transfer protocol (or xe2x80x9cFTPxe2x80x9d), etc.).
Basically, each of the layers encapsulates, or converts, data in a high level layer. For example, referring to FIG. 1, user data 100 as a byte stream is provided with a TCP header 102 to form a TCP segment 110. The TCP segment 110 is provided with an IP header 112 to form an IP datagram 120. The IP datagram 120 is provided with a network header 122 to define a network-level packet 130. The physical layer converts the network-level packet to radio, electrical, optical (or other) signals sent over the transmission medium at a specified rate with a specified type of modulation.
The TCP header 102, as illustrated in FIG. 2, includes at least twenty (20) octets (i.e., 160 bits). Fields 202 and 204 identify ports at the source and destination systems, respectively, that are using the connection. Values in the sequence number 206, acknowledgement number 208 and window 216 files are used to provide flow and error control. The value in the checksum field 218 is used to detect errors in the TCP segment 110.
FIGS. 3A and 3B illustrate two (2) alternative IP headers 112 and 112xe2x80x2, respectively. Basically, FIG. 3A depicts the IP protocol (Version 4) which has been used. FIG. 3B depicts a next generation IP protocol (Version 6) which, among other things, provides for more source and destination addresses.
More specifically, referring to FIG. 3A, the four (4) bit version field 302 indicates the version number of the IP, in this case, version 4. The four (4) bit Internet header length field 304 identifies the length of the header 112 in 32-bit words. The eight (8) bit type of service field 306 indicates the service level that the IP datagram 120 should be given. The type of service (or xe2x80x9cTOSxe2x80x9d) field 306 will be discussed in more detail in xc2xa71.2.2.1.1 below. The sixteen (16) bit total length field 308 identifies the total length of the IP datagram 120 in octets. The sixteen (16) bit identification field 310 is used to help reassemble fragmented user data carried in multiple packets. The three (3) bit flags field 312 is used to control fragmentation. The thirteen (13) bit fragment offset field 314 is used to reassemble a datagram 120 that has become fragmented. The eight (8) bit time to live field 316 defines a maximum time that the datagram is allowed to exist within the network it travels over. The eight (8) bit protocol field 318 defines the higher-level protocol to which the data portion of the datagram 120 belongs. The sixteen (16) bit header checksum field 320 permits the integrity of the IP header 112 to be checked. The 32 bit source address field 322 contains the IP address of the sender of the IP datagram 120 and the 32 bit destination address field 324 contains the IP address of the host to which the IP datagram 120 is being sent. Options and padding 326 may be used to describe special packet processing and/or to ensure that the header 112 takes up a complete set of 32 bit words.
Referring to FIG. 3B, the four (4) bit version field 302 indicates the version number of the IP, in this case, version 6. The four (4) bit priority field 328 enables a sender to prioritize packets sent by it. The 24 bit flow label field 330 is used by a source to label packets for which special handling is requested. The sixteen (16) bit payload length field 332 identifies the size of the data carried in the packet. The eight (8) bit next header field 334 is used to indicate whether another header is present and if so, to identify it. The eight (8) bit hop limit field 336 serves to discard the IP datagram 120 if a hop limit (i.e., the number of times the packet is routed) is exceeded. Also provided are 128 bit source and destination address fields 322xe2x80x2 and 324xe2x80x2, respectively.
Having described the TCP/IP protocol suite, the routing of a TCP/IP packet is now described in xc2xa71.2.1.1.2 below.
xc2xa71.2.1.1.2 Routing TCP/IP Packets
A TCP/IP packet is communicated over the Internet (or any internet or intranet) via routers. Basically, routers in the Internet use destination address information (Recall fields 324 and 324xe2x80x2) to forward packets towards their destination. Routers interconnect different networks. More specifically, routers accept incoming packets from various connected networks, use a look-up table to determine a network upon which the packet should be placed, and routes the packet to the determined network. The router may buffer incoming packets if the networks are providing packets faster than it can route them. Similarly, the router may buffer outgoing packets if the router provides outgoing packets faster than the determined networks can accept them.
FIG. 4, which includes FIGS. 4A through 4C, illustrates the communication of data from a sender, to a receiver, using the TCP/IP protocol suite. Referring first to FIG. 4A, an application protocol 402 prepares a block of data (e.g., an e-mail message (SMTP) a file (FTP), user input (TELNET), etc.) 100 for transmission. Before the data 100 are sent, the sending and receiving applications agree on a format and encoding and agree to exchange data. If necessary the data are converted (character code, compression, encryption, etc.) to a form expected by the destination.
The TCP layer 404 may segment the data block 100, keeping track of the sequence of the blocks. Each TCP segment 110 includes a header 102 containing a sequence number (recall field 206) and a frame check sequence to detect errors. A copy of each TCP segment is made so that, in the event of segment loss or damage, it can be retransmitted. When an acknowledgement of safe receipt is received from the receiver, the copy of the segment is erased.
The IP layer 406 may break a TCP segment into a number of datagrams 120 to meet size requirements of networks over which the data will be communicated. Each datagram includes the IP header 112.
A network layer 408, such as frame relay for example, may apply a header and trailer 122 to frame the datagram 120. The header may include a connection identifier and the trailer may contain a frame check sequence for example. Each frame 130 is then transmitted, by the physical layer 410, over the transmission medium as a sequence of bits.
FIG. 4B illustrates the operation of TCP/IP at a router in the network. The physical layer 412 receives the incoming signal 130 from the transmission medium and interprets it as a frame of bits. The network (e.g., frame relay) layer 414 removes the header and trailer 122 and processes them. A frame check sequence may be used for error detection. A connection number may be used to identify the source. The network layer 414 then passes the IP datagram 120 to the IP layer 418.
The IP layer examines the IP header 112 and makes a routing decision (Recall the destination address 324, 324xe2x80x2.). A local line control (or xe2x80x9cLLCxe2x80x9d) layer 420 uses a simple network management protocol (or xe2x80x9cSNMPxe2x80x9d) adds a header 450 which contains a sequence number and address information. Another network layer 422 (e.g., media access control (or xe2x80x9cMACxe2x80x9d)) adds a header and trailer 460. The header may contain address information and the trailer may contain a frame check sequence. The physical layer 424 then transmits the frame 150 over another transmission medium.
FIG. 4C illustrates the operation of TCP/IP at a receiver. The physical layer 432 receives the signal from the transmission medium and interprets it as a frame of bits. The network layer 434 removes the header and trailer 460 and processes them. For example, the frame check sequence in the trailer may be used for error detection. The resulting packet 140 is passed to the transport layer 436 which processes the header 450 for flow and error control. The resulting IP datagram 120 is passed to the IP layer 438 which removes the header 112. Frame check sequence and other control information may be processed at this point.
The TCP segment 110 is then passed to the TCP layer 440 which removes the header 102 and may check the frame check sequence (in the event of a match, the match is acknowledged and in the event of a mismatch, the packet is discarded). The TCP layer 440 then passes the data 100 to the application layer 442. If the user data was segmented (or fragmented), the TCP layer 440 reassembles it. Finally, the application layer 442 performs any need transformations, such as decompression and decryption for example, and directs the data to an appropriate area of the receiver, for use by the receiving application.
xc2xa71.2.1.2 High Speed Networks
As discussed in xc2xa71.2.1 above, there has been a trend from circuit switched networks towards packet switched networks. For example, packet switched communications presently appear to be the preferred mode of communication over a Broadband-Integrated Services Digital Network (or xe2x80x9cB-ISDNxe2x80x9d) service. Packet switching includes normal packet switching (e.g., X25) and fast packet switching (e.g., Asynchronous Transfer Mode or xe2x80x9cATMxe2x80x9d). Normal packet switching assumes certain errors at each data link are probable enough to require complex protocols so that such errors can be controlled at each link. Link errors were a valid assumption and concern at one time. However, today data links are very reliable such that the probability of errors being introduced by data links are no longer of any concern. Hence, fast packet switching is becoming more prominent. ATM is discussed in xc2xa71.2.1.2.1 below.
xc2xa71.2.1.2.1 The Asynchronous Transfer Mode (ATM) Protocol
Since data links are very reliable and the probability of errors being introduced by data links are no longer of any great concern, ATM fast packet switching does not correct errors or control flow within the network (i.e., on a link-by-link basis). Instead, ATM is only concerned with three types of errors; namely bit errors, packet loss, and packet insertion. Bit errors are detected and/or corrected using end-to-end protocols. Regarding packet loss and insertion errors, ATM only uses prophylactic actions when allocating resources during connection set-up. That is, ATM operates in a connection-oriented mode such that when a connection is requested, a line terminal first checks whether sufficient resources (i.e., whether sufficient bandwidth and buffer area) are available. When the transfer of information is complete, the resources are xe2x80x9creleasedxe2x80x9d (i.e., are made available) by the line terminal. In this way, ATM reduces the number of overhead bits required with each cell, thereby permitting ATM to operate at high data rates.
The ATM protocol transfers data in discrete sized chunks called xe2x80x9ccellsxe2x80x9d. The use of fixed sized cells simplifies the processing required at each network node (e.g., switch) thereby permitting ATM to operate at high data rates. The structure of ATM cells is described in more detail below.
Finally, the ATM protocol permits multiple logical (or xe2x80x9cvirtualxe2x80x9d) connections to be multiplexed over a single physical interface. As shown in FIG. 5, logical connections in ATM are referred to as virtual channel connections (or xe2x80x9cVCCsxe2x80x9d) 510. A VCC 510 is the basic unit of switching in an ATM network. A VCC 510 is established between two end users, through the network. A variable-rate, full-duplex flow of ATM cells may be exchanged over the VCC 510. VCCs 510 may also be used for control signaling, network management and routing.
A virtual path connection (or VPC) 520 is a bundle of VCCs 510 that have the same end points. Accordingly, all of the cells flowing over all VCCs 510 in a single VPC 520 may be switched along the same path through the ATM network. In this way, the VPC 520 helps contain network control costs by grouping connections sharing common paths through the network. That is, network management actions can be applied to a small number of virtual paths 520 rather than a large number of individual virtual channels 510.
Finally, FIG. 5 illustrates that multiple virtual paths 520 and virtual channels 510 (i.e., logical connections) may be multiplexed over a single physical transmission path 530.
FIG. 6 illustrates the basic architecture for an interface between a user and a network using the ATM protocol. The physical layer 610 specifies a transmission medium and a signal-encoding (e.g., data rate and modulation) scheme. Data rates specified at the physical layer 610 may be 155.52 Mbps or 622.08 Mbps, for example. The ATM layer 620 defines the transmission of data in fixed sized cells and also defines the use of logical connections, both introduced above. The ATM adaptation layer 630 supports information transfer protocols not based on ATM. It maps information between a high layer 640 and ATM cells.
Recall that the ATM layer 620 places data in fixed sized cells (also referred to as a packet). An ATM packet includes a header field (generally five (5) bytes) and a payload (or information) field (generally 48 bytes). The main function of the header is to identify a virtual connection to guarantee that the ATM packet is properly routed through the network. Switching and/or multiplexing is first performed on virtual paths and then on virtual channels. The relatively short length of the payload or information field reduces the size required for internal buffers at switching nodes thereby reducing delay and delay jitter.
More specifically, FIG. 7A illustrates an ATM cell 700 having a header 710 as formatted at a user-network interface, while FIG. 7B illustrates the ATM cell 700xe2x80x2 having a header 710xe2x80x2 as formatted internal to the network. Referring first to the header 710 as formatted at the user-network interface, a four (4) bit generic flow control field 712 may be used to assist an end user in controlling the flow of traffic for different qualities of service. The eight (8) bit virtual path identifier field 714 contains routing information for the network. Note that this field 714xe2x80x2 is expanded to twelve (12) bits in header 710xe2x80x2 as formatted in the network. In both headers 710 and 710xe2x80x2, a sixteen (16) bit virtual channel identifier field 716 contains information for routing the cell to and from the end users. A three (3) bit payload type field 718 indicates the type of information in the 48 octet payload portion 750 of the packet. (The coding of this field is not particularly relevant for purposes of the present invention.) A one (1) bit cell loss priority field 720 contains information to let the network know what to do with the cell in the event of congestion. A value of 0 in this field 720 indicates that the cell is of relatively high priority and should not be discarded unless absolutely necessary. A value of 1 in this field indicates that the network may discard the cell. Finally, an eight (8) bit header error control field 722 contains information used for error detection and possibly error correction as well. The remaining 48 oclets 750 define an information field.
Fast packet switching, such as ATM switching, has three main advantages. First ATM switching is flexible and is therefore safe for future transfer rates. Second, no resources are specialized and consequently, all resources may be optimally shared. Finally, ATM switches permit economies of scale for such a universal network.
xc2xa71.2.1.2.2 Switches
ATM packets (cells) are routed through a network by means of a series of ATM switches. An ATM switch must perform three basic functions for point-to-point switching; namely, (i) routing the ATM cell, (ii) updating the virtual channel identifier (VCI) and virtual path identifier (VPI) in the ATM cell header (Recall fields 714, 714xe2x80x2 and 716xe2x80x2.), and (iii) resolving output port contention. The first two functions, namely routing and updating, are performed by a translation table belonging to the ATM switch. The translation table converts an incoming link (input port) and VCI/VPI to an outgoing link (output port) and VCI/VPI. An arbiter is used to resolve output port contention among two or more ATM cells destined for the same output port. The arbiter chooses an ATM cell which xe2x80x9cwinsxe2x80x9d contention (i.e., which is applied to the output port). Other ATM cells contending for the output port xe2x80x9closexe2x80x9d contention (i.e., they must wait before being applied to the output port).
To prevent the ATM cells not winning contention for the output port from being lost, buffering is required. There are three basic buffering strategies; namely, pure input queuing, pure output queuing and central queuing. Pure input queuing provides a dedicated buffer at each input port. Arbitration logic is used to decide which inlet buffer will be next served. The arbitration logic may be simple (e.g., round robin in which the inlet buffers are served in order, or random in which the inlet buffers are served randomly) or complex (e.g., state dependent in which the most filled buffer is served next, or delay dependent in which the globally oldest cell is served next).
Unfortunately, with input queuing, an ATM cell in the front of the queue waiting for an occupied output channel to become available may block other ATM cells behind it which do not need to wait. This is known as head-of-line (HOL) blocking. A post office metaphor has been used to illustrate head-of-line (HOL) blocking in the book, M. dePrycker, Asynchronous Transfer Mode: Solution for Broadband ISDN, pp. 133-137 (Ellis Horwood Ltd., 1991). In the post office metaphor, people (representing ATM cells) are waiting in a line (representing an input buffer) for either a stamp window (a first output port) or an airmail window (a second output port). Assume that someone (an ATM cell) is already at the stamp window (the first output port) and that the first person in the line (the HOL of the input buffer) needs to go to the stamp window (the first output port). Assume further that no one is presently at the airmail window (the second output port) and that the second and third people in line (ATM cells behind the HOL cell in the input queue) want to go to the airmail window (the second output port). Although the airmail window (second output port) is available, the second and third people (ATM cells behind the HOL cell) must wait for the first person (the HOL cell) who is waiting for the stamp window (the first output port) to become free. Therefore, as the post office metaphor illustrates, the head-of-line (HOL) cell waiting for an output port to become free often blocks ATM cells behind it which would otherwise not have to wait. Simulations have shown that such head-of-line (HOL) blocking decreases switch throughput.
Pure output buffering solves the head-of-line (HOL) blocking problems of pure input buffering by providing only the output ports with buffers. Since the ATM cells buffered at an output port are output in sequence (i.e., first in, first out, or xe2x80x9cFIFOxe2x80x9d), no arbitration logic is required. In the post office metaphor, the stamp window (first output port) has its own line (first output buffer) and the airmail window (second output port) has its own line (second output buffer).
Although pure output buffering clearly avoids HOL blocking that may occur in pure input port buffering, it does have some disadvantages. Specifically, to avoid cell loss, assuming N input ports, the system must be able to write N ATM cells into any one of the queues (or output buffers) during one cell time (i.e., within 2.8 microseconds, where 2.8 microseconds is (53 bytes* 8 bits/byte)/155.52 Mbit/second. Such a high memory write rate is necessary because it is possible that each of the ATM cells arriving at each of the input ports will require the same output port. This requirement on the memory speed of the output buffer becomes a problem as the size of the switch (i.e., as N) increases. Accordingly, for a 1024 by 1024 switch (i.e., a switch having 1024 inputs and 1024 outputs), pure output buffering is not feasible because the speed of the output port buffers would have to be fast enough to handle 1024 ATM cells. This problem is discussed in more detail in xc2xa71.2.2.3.1 below.
Central queuing includes a queue not assigned to any inlet (input port) or outlet (output port). Each outlet will select ATM cells destined for it in a first in, first out (FIFO) manner. However, the outlets must be able to know which cells are destined for them. Moreover, the read and write discipline of the central queue cannot be a simple FIFO because ATM cells destined for different outlets are all merged into a single queue. Turning again to the post office metaphor, a single line (central queue) of people (ATM cells) are waiting to visit the stamp window (a first output port) or the airmail window (a second output port). As a window opens up (i.e., as an output port becomes available), a server searches the line (central queue) for the next person (ATM cell) needing the available window (requiring the available output port). The server brings that person (ATM cell) to the open window (available output port) regardless of whether the person (the ATM cell) is at the front of the line (HOL). As the post office metaphor illustrates, the central queue requires complex memory management system given the random accessibility required. Of course, the memory management system becomes more complex and cumbersome when the number of output ports (i.e., the size of the switch) increases.
Thus, conceptually, an ATM switch may include input port controllers for accepting ATM cells from various physical (or logical) links (Recall FIG. 5.), a switching fabric for forwarding cells to another link towards their destination, and output port controllers for buffering ATM cells to be accepted by various physical (or logical) links. An exemplary, scalable, ATM switch is disclosed in U.S. Pat. Nos. 5,724,351 and 5,790,539 (incorporated herein by reference).
xc2xa71.2.2 The Need to Consider Different Types of Traffic
As discussed in xc2xa71.2.1 above, different applications place different demands on communications networks. In particular, a certain application may require that its traffic be communicated (i) with minimum delay, (ii) at a fast rate, (iii) with maximum reliability, and/or (iv) to minimize communications (service) cost. For example, people would not tolerate much delay in their voice communications during a telephone call. High definition video requires a fast rate, or a high bandwidth, as well as low jitter, or delay variations. However, video communications may be able to tolerate some data corruption or loss to the extent that such losses are imperceptible or not annoying to people. The communications of important data, on the other hand, may tolerate delay, but might not tolerate data loss or corruption. Finally, an application may request that low priority data be communicated at a minimum cost. To the extent that the network traffic of an application does not have xe2x80x9cspecialxe2x80x9d requirements, it should be communicated with normal service.
Thus, many applications require a guaranteed quality of service (or xe2x80x9cQoSxe2x80x9d) from a network provider. The network provider, in turn, may see guaranteeing QoS as a means to add value to their network and increase revenues. Although quality of service issues are important, at least to some extent, in all communications networks, the invention will be described in the context of packet switched networks in general, and TCP/IP and ATM networks in particular. This is because TCP/IP and ATM networks are envisioned as carrying many different types of data for many different applications which have different needs.
The ways in which the TCP/IP and ATM protocols permit supporting networks to help guarantee quality of service are introduced in xc2xa7xc2xa71.2.2.1 and 1.2.2.2, respectively, below. Then, the ways in which output port queues of TCP/IP routers or ATM switches may be managed to manage traffic and meet congestion goals are discussed in xc2xa71.2.2.3 below. The challenges to scheduling and managing packets in the output port queues, which the present invention addresses, will also be discussed in that section.
xc2xa71.2.2.1 Internet Protocol
The fourth and six versions of the internet protocol (xe2x80x9cIPxe2x80x9d), discussed in xc2xa71.2.1.1.1 above, include fields which may be used to manage traffic over an internetwork. Although these fields are known to those skilled in the art, they are described in xc2xa7xc2xa71.2.2.1.1. and 1.2.2.1.2 below for the reader""s convenience.
xc2xa71.2.2.1.1 Type of Service Field
Recall from FIG. 3A above that version 4 of the internet protocol includes an eight (8) bit type of service field 306. As shown in FIG. 8, this field 306 includes a three (3) bit precedence sub-field 810 and a four (4) bit type of service sub-field 820. The type of service sub-field 820 guides an IP entity, in a source or a router, in selecting a next hop for the IP datagram. The precedence sub-field 810 guides the relative allocation of router resources for the datagram.
The eight (8) precedence levels encoded by the three (3) bit sub-field 810, in order of decreasing importance, are:
111 Network control;
110 Internetwork control;
101 Critical;
100 Flash override;
011 Flash;
010 Intermediate;
001 Priority; and
000 Routine.
Routers may ignore this sub-field 810. If, on the other hand, a router supports the precedence sub-field 810, it may base route selection, subnetwork service, and/or queuing discipline on this sub-field 810. The present invention concerns the transmission of packets in the output port queues.
At present, the five (5) types of services encoded by the four (4) bit sub-field 820 are:
1000 Minimize delay;
0100 Maximize throughput;
0010 Maximize reliability;
0001 Minimize network charge costs; and
0000 Normal service.
As was the case with the precedence sub-field 810, routers may ignore this sub-field 820. If, on the other hand, a router supports the type of service sub-field 820, it may base route selection, subnetwork service, and/or queuing discipline on this sub-field 820. To reiterate, the present invention is concerned with queuing discipline. For example, a router may preferentially treat queues to datagrams requesting minimized delay. A router may attempt to avoid discarding (or dropping) datagrams requesting maximized reliability.
xc2xa71.2.2.1.2 Priority Field
Recall from FIG. 3B above that version 6 of the internet protocol includes a four (4) bit priority field 328. This field 328 allows a source to identify desired transmit and delivery priorities of a packet relative to other packets from the same source. First, packets are classified as being part of traffic with the source either providing or not providing congestion control. Second, packets are assigned to one (1) of eight (8) levels or relative priority within each classification.
Congestion controlled traffic can, to differing extents, be delayed or be received out of order. Thus, the source can slow its transmission of congestion controlled traffic in response to network congestion. Version 6 of the internet protocol defines eight (8) categories of congestion controlled traffic. They are, in order of increasing priority:
0 Uncharacterized traffic;
1 xe2x80x9cFillerxe2x80x9d traffic;
2 Unattended data transfer (e.g., e-mail);
3 (Reserved);
4 Attended bulk transfer (e.g., FTP, HTTP);
5 (Reserved);
6 Interactive traffic (e.g., TELNET); and
7 Internet control traffic.
Non-congestion controlled traffic is traffic for which a constant (or at least relatively smooth) data rate and a constant (or at least relatively smooth) delivery delay are desired. For example, real time audio or video may be characterized as non-congestion controlled traffic. However, some packet loss (dropped packets) is acceptable. This traffic has eight (8) levels of priority, from the lowest priority (8) to the highest priority (15). For example, high definition video has a fair amount of redundancy and the loss of a few packets would likely be imperceptible, while with low fidelity audio, the loss of a few packets would be readily perceived as annoying clicks and buzzes. Thus, low fidelity audio would have a higher priority than high definition video.
xc2xa71.2.2.1.3 Integrated Services Architecture (xe2x80x9cISAxe2x80x9d)
Historically, internets based on the IP protocol provided a simple xe2x80x9cbest effortxe2x80x9d delivery service. The fields discussed above concerning priority and type of service have generally been ignored by routers. Basically, the routers merely (i) used routing algorithms to select routes to minimize delay, and (ii) discarded most recently received packets in the event of a buffer overflow. These mechanisms are quickly becoming unsatisfactory. Given the need to support a variety of traffic having a variety of quality of service (xe2x80x9cQoSxe2x80x9d) requirements within TCP/IP networks, the Integrated Services Architecture (or xe2x80x9cISAxe2x80x9d) was developed to provide QoS transport over IP-based internets. Basically, the ISA decides how to share available network capacity in times of congestion.
Basically, the ISA manages congestion and provides QoS transport via (i) admission control, which requires that a reservation be made for new flows (Recall fields 310 and 330 of FIGS. 3A and 3B, respectively.), (ii) routing algorithms which consider QoS parameters, (iii) queuing policies which consider QoS parameters, and (iv) a discard policy based on QoS parameters.
FIG. 9 is a high level block diagram of the ISA architecture 900. The routing protocol(s) 914 maintains a routing database 912 that provides a xe2x80x9cnext hopxe2x80x9d to be taken for each destination address and each flow. The classifier and route selection means 910 determines the next hop address for a packet, based on the packet""s class and destination address (recall field 324 or 324xe2x80x2). A class corresponds to flow(s) having the same QoS requirements.
A reservation protocol is used, among routers and between routers and end users, to reserve resources for a new flow at a given level of QoS. It updates the traffic control database 922 used by the packet scheduler 920 to determine the service provided for the packets of each flow. An admission control means 926 determines if sufficient resources are available for a flow requesting a reservation at a given QoS, and is invoked by the reservation protocol 924. The management agent 928 can modify the traffic control database 922 and set admission control policies in the admission control means 926. To reiterate, the packet scheduler 920 manages. one or more queues (930, 940) for each output port of a router. More specifically, it determines the order in which queued packets are transmitted and, if necessary, which packets to discard (or drop). To reiterate, the present invention concerns the transmission of packets from output port queues.
In this way, ISA provides three (3) categories of service; namely, guaranteed, controlled load, and best effort. Guaranteed service (i) assures capacity level or data rate, (ii) bounds queuing delays through the network, and (iii) eliminates queuing losses. Controlled load service (i) approximates best effort service under unloaded conditions, (ii) does not bound queuing delays (though a very high percentage of packets do not experience delays, and (iii) has almost no queuing losses. Best effort service is just as its name suggests, with no special priorities.
xc2xa71.2.2.2 ATM Protocol
ATM networks also have the challenge of providing various qualities of service to various types of traffic. Basically, ATM networks need a control scheme for delay sensitive traffic, such as real time voice and video, and for bursty traffic (i.e., irregular traffic having intermittent xe2x80x9cburstsxe2x80x9d of transmitted data). The aspects of ATM that provided the benefits discussed in xc2xa71.2.1.2.1 above, present challenges when it comes to controlling traffic. For example, traffic not amenable to flow control, such as voice and video sources, will continue transmitting even when the network is congested. Further, their high speed switching and transmission make ATM networks more volatile in terms of congestion and traffic control. That is, transmission and switching are so fast that, during the time between the detection of a problem (e.g., a dropped cell) and its indication at the transmission source, a lot of unnecessary data will have already been transmitted. In other words, feedback is slow relative to propagation delays of the network.
The ATM forum has defined five (5) service categories; namely, (i) constant bit rate (or xe2x80x9cCBRxe2x80x9d), (ii) real-time variable bit rate (or xe2x80x9crt-VBRxe2x80x9d), (iii) non-real-time variable bit rate (or xe2x80x9cnrt-VBRxe2x80x9d), (iv) unspecified bit rate (or xe2x80x9cUBRxe2x80x9d) and (v) available bit rate (or xe2x80x9cABRxe2x80x9d). Constant bit rate (CBR) service requires that the network support a fixed data rate. Real-time variable bit rate (rt-VBR) is defined in terms of a sustained rate for normal use and a faster burst rate for occasional use at peak periods. The faster rate is guaranteed but the user will not continuously require this rate. Bounds on cell transfer delay and delay variation are also specified. Non-real-time variable bit rate (nrt-VBR) is similar to rt-VBR except there is no delay variation bound specified. Further, a certain low cell loss ratio is allowed. Unspecified bit rate (UBR) is a best effort service. That is, no amount of capacity is guaranteed and any cells may be discarded. ABR provides a user with a guaranteed minimum capacity. When additional capacity is available, the user may burst above the minimum rate, though with a minimized risk of cell loss.
The service categories defined by the ATM forum are characterized by a number of ATM attributes. These attributes all into three (3) basic categories; namely, (i) traffic descriptors, (ii) QoS parameters, and (iii) other. Traffic descriptors characterize the traffic pattern of a flow of cells over an ATM connection. Such a traffic pattern is defined by (i) source traffic descriptors and connection traffic descriptors. Source traffic descriptors include (i) peak cell rate (or xe2x80x9cPCRxe2x80x9d), (ii) sustainable cell rate (or xe2x80x9cSCRxe2x80x9d), (iii) maximum burst size (or xe2x80x9cMBSxe2x80x9d), and (iv) minimum cell rate (or xe2x80x9cMCRxe2x80x9d). Connection traffic descriptors include (i) cell delay variation tolerance (or xe2x80x9cCDVTxe2x80x9d) and (ii) a conformance definition. Quality of service (xe2x80x9cQoSxe2x80x9d) parameters may include (i) peak-to-peak cell delay variation, (ii) maximum cell transfer delay, (iii) cell loss ratio, (iv) cell error ratio, (v) severely errored cell block ratio, (vi) cell misinsertion rate, and (vii) cell transfer delay. FIG. 10 is a plot of probability versus cell transfer delay and illustrates peak-to-peak cell delay variation and maximum cell transfer delay.
An ATM network may control traffic via (i) resource management using virtual paths, (ii) connection admission control, (iii) usage parameter control, (iv) traffic shaping, (v) selective cell discard, and(vi) cell scheduling. Selective cell discard and cell scheduling are performed at output ports of switches. The present invention concerns cell (in the context of ATM switches for example) or packet (in the context of routers for example) scheduling.
xc2xa71.2.2.3 Servicing Output Port Queues to Aid Traffic Management and Congestion Goals
As mentioned above, TCP/IP internets and ATM networks (as well as other types of networks) may manage queues at output ports of routers or switches to facilitate QoS goals. Although various queuing disciplines are known to those skilled in the art, they are described here for the reader""s convenience.
xc2xa71.2.2.3.1 FIFO Queue
Routers and switches have traditionally used first-in, first-out (or xe2x80x9cFIFOxe2x80x9d) output port queues. FIG. 11 illustrates a FIFO queue 1110 which services a number of flows 1130 destined for the same transmission medium server 1120. However, FIFO queues have some disadvantages (such as those introduced in xc2xa71.2.1.2.2 above). First, packets from higher priority flows or flows which are more delay sensitive receive no special treatment. Second, a xe2x80x9cgreedyxe2x80x9d transmission source (i.e., one that does not back off when network congestion exists), can crowd out other connections. Finally, in the context of TCP/IP, shorter packets can become xe2x80x9cstuckxe2x80x9d behind longer packets. (Recall that in ATM, all packets are fixed sized (53 octets) cells). Accordingly, a better queuing discipline is needed.
xc2xa71.2.2.3.2 Queues For Each xe2x80x9cFlowxe2x80x9d
Rather than providing a single queue 1110 for all flows 1130, as shown in FIG. 12, a separate queue 1210 may be provided for each flow 1130. Various ways of servicing these queues 1210 are discussed in xc2xa7xc2xa71.2.2.3.2.1 through 1.2.2.3.2.4 below. The first two (2) methods of servicing queues (i.e., fair queue and processor sharing) do not consider QoS parameters. Only the third and fourth methods (i.e., generalized processor sharing and weighted fair queuing) considers such QoS parameters.
xc2xa71.2.2.3.2.1 Fair Queuing
In the xe2x80x9cfair queuingxe2x80x9d technique, multiple queues 1210 (e.g., one per source or flow) are provided at each output port as shown in FIG. 12. These queues are serviced in a round robin manner. Thus, with the fair queuing technique, the problem of xe2x80x9cgreedyxe2x80x9d connections crowding out other connection is solved. However, in the context of TCP/IP (or any other protocol that does not fix the size of packets), shorter packets are penalized. That is, in terms of the amount of data transmitted, flows having large packets will have much more data transmitted than flows having smaller packets. To reiterate, this method does not consider QoS parameters.
xc2xa71.2.2.3.2.2 Processor Sharing
Like the fair queuing method discussed in xc2xa71.2.2.3.2.1 above, bit round robin fair queuing (or xe2x80x9cBRFQxe2x80x9d) considers flow ID (Recall, e.g., fields 310 and 330.) when queuing packets. However, BRFQ also considers packet length. In the ideal case, referred to as processor sharing, multiple queues would be serviced, round robin, where only one bit is taken from each queue per round. Naturally, since packets may have various sizes in the TCP/IP protocol, this ideal case cannot be performed. The BRFQ method approximates processor sharing by determining a virtual time, which records the rate of service seen by a packet at the head of a queue. The virtual time v(t) is defined as the number of rounds that have occurred up to time t, normalized to the output data rate. The rate of the virtual time v(t)xe2x80x2 may be expressed as:                                           v            xe2x80x2                    ⁡                      (            t            )                          =                                            ∂                              v                ⁡                                  (                  t                  )                                                                    ∂              t                                =                      1                          max              ⁡                              [                                  1                  ,                                      N                    ⁡                                          (                      t                      )                                                                      ]                                                                        (        1        )            
where N(t)xe2x89xa1the number of non-empty queues at time t.
When a kth packet arrives at a queue for flow i time aik, it is stamped with a xe2x80x9cvirtual finish timexe2x80x9d or xe2x80x9ctime stampxe2x80x9d (Fik) which may be expressed as:
Fik=Sik+Pikxe2x80x83xe2x80x83(2)
where Sik is referred to as xe2x80x9cvirtual starting timexe2x80x9d or xe2x80x9cstarting potentialxe2x80x9d; and Pikxe2x89xa1the transmission time for the kth packet in queue i, normalized to the output data rate.
The xe2x80x9cvirtual starting timexe2x80x9d or xe2x80x9cstarting potentialxe2x80x9d Sik may be expressed as:
Sik=max[Fikxe2x88x921, v(aik)]xe2x80x83xe2x80x83(3)
where aikxe2x89xa1the arrival time of the kth packet in queue i.
Using the forgoing equations, a packet""s virtual finishing time (or time stamp) can be determined the moment it arrives at a queue i. However, in practice, a packet""s virtual finishing time (or time stamp) is determined when the packet becomes a head-of-line packet. Under the BRFQ method, whenever a packet finishes transmission, the next packet sent is the one with the smallest value of Fik (or time stamp or virtual finish time). It has been proven that throughput and average delay of each flow under BRFQ converges to processor sharing as time increases.
xc2xa71.2.2.3.2.3 Generalized Processor Sharing
Recall the fair queuing and bit round robin fair queuing methods do not provide different amounts of capacity to different flow types. The generalized processor sharing method is generalized to bits, and does not consider the various packet sizes that may be present on a TCP/IP internetwork (or other networks supporting packets of various sizes). In the generalized processor sharing method, each flow i has a weight xcfx86i that determines a number of bits to be transmitted from the queue i during each round. Thus, equation 2 above becomes:                               F          i          k                =                              S            i            k                    +                                    P              i              k                                      φ              i                                                          (        4        )            
Sik is determined as set forth in equation 3 above. A service rate gi for non-empty flow i can be defined as:                               g          i                =                              C            ⁢                          xe2x80x83                        ⁢                          φ              i                                                          ∑              j                        ⁢                          φ              j                                                          (        5        )            
where Cxe2x89xa1the data rate of the outgoing link.
The generalized processor sharing method provides a way to guarantee that delays for a well behaved flow do not exceed a bound. In the xe2x80x9cleaky bucketxe2x80x9d traffic shaping model (discussed in xc2xa71.2.2.3.2.5 below), if the weight assigned to each flow is the token rate (xcfx86i=Ri), then the maximum delay Di experienced by flow i is bound (i.e., less than or equal to) Bi/Ri, where Bi is the bucket size for flow i and Ri is the token rate for flow i.
To summarize, the generalized processor sharing method permits different capacity to be assigned to different flows. However, it is generalized to bits, and does not consider packets which may have differing lengths.
xc2xa71.2.2.3.2.4 Weighted Fair Queuing
Weighted fair queuing emulates the bit-by-bit generalized processor sharing (just as bit round robin fair queuing emulated fair queuing), but considers packets rather than bits. Under the weighted fair queuing method, whenever the transmission of a packet is finished, the next packet transmitted is the one with the smallest Fik (or time stamp or virtual finish time). The weighted fair queuing method allows a router to set parameters to guarantee a given rate of service. The bound delay can be expressed as:                               D          i                =                                            B              i                                      R              i                                +                                                    (                                                      K                    i                                    -                  1                                )                            ⁢                              L                i                                                    R              i                                +                                    ∑                              m                =                1                                            K                i                                      ⁢                                          L                max                                            C                m                                                                        (        6        )            
where Kixe2x89xa1the number of nodes in the path through the internet for flow i;
Lixe2x89xa1the maximum packet size for flow i;
Lmaxxe2x89xa1the maximum packet length for all flows through all nodes of the path of flow i; and
Cmxe2x89xa1the data rate of the outgoing link at node m.
Thus, the weighted fair queuing method permits different capacity to be assigned to different flows and considers packets which may have differing lengths.
xc2xa71.2.2.3.2.5 Challenges
To reiterate, using the weighted fair queuing method, different capacity can be assigned to different flows which may have packets of different lengths. A global function is used to compute a xe2x80x9cvirtual finishing timexe2x80x9d or xe2x80x9ctime stampxe2x80x9d (that is Fik) for each packet or each head-of-line packet for each queued flow. Basically, the xe2x80x9cvirtual finishing timexe2x80x9d is the sum of its xe2x80x9cvirtual starting timexe2x80x9d or xe2x80x9cstarting potentialxe2x80x9d (that is, Sik) and the time needed to transmit the packet at its reserved bandwidth. The queued head-of-line packets are served in the order of their time stamps.
Referring back to FIG. 12, when the number N of queued flows is relatively small, and/or the data rate is relatively low, known sorting or searching methods may be used to determine the head-of-line packet with the lowest time stamp. However, as the number N of queue flows increases (and higher data rates are used), these known methods become unsatisfactory. That is, one packet is selected per time interval, and as the line rate increases, the time interval decreases. For example, at a line rate of 155 Mbps, a 53 byte ATM cell occupies a 2.8 xcexcs time slot.
Recently a worst-case fairness index (or xe2x80x9cWFIxe2x80x9d) has been introduced to measure how closely a packet-by-packet scheduler emulates the generalized processor sharing method. Shaper schedulers have been proposed to minimize WFI. In the shaper schedulers, all arriving packets are first linked in a shaper queue based on their starting potentials. Only packets whose stating potentials are less than or equal to the virtual time or system potential are deemed xe2x80x9celigiblexe2x80x9d to join the scheduler. In the schedulers, packets are transmitted as usual, by increasing order of their time stamps.
Basically, traffic shaping is used to smooth traffic flow, thereby reducing packet or cell xe2x80x9cclumpingxe2x80x9d. Shaping may be implemented with a token bucket algorithm to control flow of cells. FIG. 13 illustrates a shaper 1300 employing the token bucket algorithm. Arriving packets are queued at packet queue 1320 having a capacity K. The server 1310 will accept the next packet only if a token is available from the token bucket 1330 (i.e., if the token bucket is not empty). The token bucket has a capacity B and is filled with tokens from a token generator 1340 at a predetermined rate. Thus, if a burst of packets arrive at a rate faster that the rate at which tokens are generated, once the token bucket is emptied, the packets at queue 1320 will be served by server 1310 at the predetermined rate at which tokens are generated by the token generator 1340.
Returning now to the challenge of scheduling the service of queued packets, a binary tree of comparators is a straight forward way to determine the next packet to be transmitted. Such a tree would have log2N levels where N is the number of queued flows at the output port. Unfortunately, as alluded to above, such a search engine would be expensive to implement.
The article: H. J. Chao et al., xe2x80x9cA VLSI Sequencer Chip for ATM Traffic Shaper and Queue Manager,xe2x80x9d IEEE J. Solid State Circuits, Vol. 27, No. 11, pp. 1634-43 (November 1992) discusses an application specific integrated circuit (or xe2x80x9cASICxe2x80x9d) sequencer chip which facilitates a priority queue with a constant time complexity (that is, independent of the number of queued flows N at the output port). This sequencer is disclosed in U.S. Pat. No. 5,278,828 (incorporated herein by reference). However, each of these chips can only handle 256 sessions (or flows). For a practical application, there could be thousands of flows. In such applications, the number of required sequencer chips would simply be too large to be cost effective.
The article: A. Lyengar et al., xe2x80x9cSwitched Prioritized Packets,xe2x80x9d Proc. IEEE GLOBECOM, pp. 1181-6 (November 1989) discusses a searching method where a number of timing queues are maintained for distinct time stamp values, thus defining a xe2x80x9ccalendar queuexe2x80x9d. More specifically, the head-of-line packets from different queued flows that have the same time stamp value are linked together forming a timing queue. A priority queue then selects a packet with the smallest time stamp. Unfortunately, this method can become too slow when the number of distinct time stamp values is large.
For example, FIG. 14 is a block diagram which illustrates a xe2x80x9ccalendar queuexe2x80x9d method. A number N (e.g., 1024) of packet queues 1410 are provided, one for each flow. The time stamp of the hand-of-line packets are shown. In this example, the time stamps range from 1 to 16,000. A storage means 1420 is provided with a number of locations, one for each of the time slots. In each of the locations 1424 has a validity bit. If any head-of-line packets have the time stamp, the validity bit will be xe2x80x9c1xe2x80x9d; otherwise it will be xe2x80x9c0xe2x80x9d. Validity bits of xe2x80x9c1xe2x80x9d point to a linked list of flow queues having a head-of-line packet with a corresponding time stamp. For example, since the head-of-line packets of the flows at the 14th and 200th queues each have a time stamp 10, a validity bit is set to xe2x80x9c1xe2x80x9d at the 10th location of the storage means 1420 and points to a linked list. Thus, the calendar queue searches through the validity bits 1422 of storage means 1420 for the first valid bit. The head-of-line packet at the flow queue associated with the first stored queue identifier of the linked list pointed to by the first valid bit will then be serviced. As mentioned above, the worst case search time is equal to the number of time stampsxe2x80x9416,000 in this case.
The article: H. J. Chao, et al., xe2x80x9cDesign of a Generalized Priority Queue Manager for ATM Switches,xe2x80x9d IEEE J. Select. Areas in Commun., Vol. 14, No. 5, pp. 867-880 (June 1997) discusses a priority content addressable memory (or xe2x80x9cPCAMxe2x80x9d) ASIC which can search for the minimum time stamp at a very high speed, can accommodate any number of sessions and any size buffer, and resolves time stamp overflow (discussed below). However, a sizable on-chip memory requirement makes the PCAM too expensive to implement. It would be desirable to use off-chip memory.
Thus, a scheduling method(s) and apparatus are needed. They should (i) have a total time complexity independent of the number of sessions in the system and (ii) use commercial (off-chip) memory. The scheduling method(s) and apparatus may also advantageously perform traffic shaping (i.e., they should achieve minimum WFI). Finally, the scheduling/shaping methods and apparatus should handle potential overflows of values represented by a finite number of bits. The present invention provides methods and apparatus to meet these goals.
The present invention uses a hierarchical searching technique to find the first memory location of a calendar queue with a validity bit of xe2x80x9c1xe2x80x9d (that is, the lowest time stamp). A number M of bits at level lowest level of the hierarchy correspond to an array of validity bits. M is the largest time stamp. The M bits are grouped into groups of g(Lxe2x88x921) bits (Where L is the number of levels in the hierarchy.). The validity bits in each groups are logically ORed, and then concatenated to define a next level Lxe2x88x922 of bits. That next level of M/g(Lxe2x88x921) bits is further grouped into groups of g(Lxe2x88x922) bits. The process of grouping bits, ORing bits of a group, and concatenating the results is repeated until the resulting string of bits having a predetermined number of bits (e.g., a number of bits that can be placed in a register) is obtained. The number of bits in each groups may be the same at each level, or may differ.
The number of bits at any level (l) can therefore be expressed as:
Ml=glxc3x97Mlxe2x88x921, where MLxe2x88x921=Mxe2x80x83xe2x80x83(7)
The Ml bit string at level l may be denoted as:
 less than b0lb1l . . . bMlxe2x88x921l greater than , where bil={0,1},i=0,1, . . . , Mlxe2x88x921xe2x80x83xe2x80x83(8)
The gl bit string of the kth group may be denoted as:
 less than bkgllbkgl+1l . . . b(k+1)glxe2x88x921l greater than xe2x80x83xe2x80x83(9)
Thus:
bklxe2x88x921=bkgll⊕bkgl+1l⊕ . . . ⊕b(k+1)glxe2x88x921l, where k=0,1, . . . , Mlxe2x88x921xe2x88x921xe2x80x83xe2x80x83(10)
The bit string at any level l (lxe2x89xa00) can be stored in a RAM of size glMlxe2x88x921. The string at the highest level in the hierarchy (l=0) can be stored in an M0 bit register. If m=log2M, then an m-bit address to the M time stamps may be denoted as  less than a0a1 . . . amxe2x88x921 greater than . Further, the address to locate any of the Ml bits at level l may be denoted as  less than a0a1 . . . amlxe2x88x921 greater than , where ml=log2Ml. Thus, the number of address bits need to address any bit at a level l may be expressed as:                                           m            l                    =                                                    log                2                            ⁢                              M                l                                      =                                                            m                                      l                    -                    1                                                  +                                                      log                    2                                    ⁢                                      g                    l                                                              =                                                ∑                                      i                    =                    0                                    l                                ⁢                                                      log                    2                                    ⁢                                      g                    i                                                                                      ,                  xe2x80x83                ⁢                              where            ⁢                          xe2x80x83                        ⁢                          g              0                                =                                    M              0                        .                                              (        11        )            
Equation (11) illustrates a method of the present invention for addressing in a hierarchical search. That is, m0 most significant bits of the time stamp address should be used at level 0. Then, at level l, the complete address used at upper level (lxe2x88x921) will be used to locate the proper gl bit word in its glMlxe2x88x921 memory. Another log2gl bits following the previous mlxe2x88x921 bits is extracted from the time stamp address and used to locate the proper bit in the gl bit word that has just been identified. In this way, the search time depends on the number L of levels. Thus, a scheduler based on the present invention can schedule large numbers of flows to be placed on a high speed data link (i.e., with a small time slot).
The present invention may also provide a shaper to minimize worst-case fairness index (or xe2x80x9cWFIxe2x80x9d). More specifically, a shaper minimizes the burstiness of the output packet stream from the scheduler. In the shaper-schedulers, all arriving packets are first linked in a shaper queue based on their starting potentials. Only packets whose stating potentials (S) are less than or equal to a system potential (v(t)) are deemed xe2x80x9celigiblexe2x80x9d to join the scheduler. That is, a packet is eligible if:
Sikxe2x89xa6v(t)xe2x80x83xe2x80x83(12)
In the schedulers, packets are transmitted as usual, by increasing order of their time stamps.
To alleviate the complexity of transferring multiple eligible packets from a shaper queue to a scheduler in a short period of time, the shaper queue is implemented as a multitude of priority lists. Each priority list is associated with a distinct value of starting potential S common to all queued. packets in the list. Thus, a two-dimensional calendar queue can be constructed based on the starting potential S of the queued packets. W is the maximum value of S. In the calendar queue, all packets with the same starting potential are placed in the same column addressed by the value of S. Further, in each of the columns, the packets are sorted according to their time stamps F. As with the calendar queue of the scheduler of the present invention, if the validity bit is xe2x80x9c1xe2x80x9d, a linked list of flow queues having head-of-line packets with virtual time stamps corresponding to the virtual finish time F (and the same starting potential) is present.
Every validity bit, or V-bit, in a column can be located by its unique address (S,F). However, it has not been proven feasible to implement a large number of priority lists (large W). The hierarchical searching method and RAM-based architecture of the present invention is extended to the shaper queue.
Finally, the present invention provides techniques for addressing a time stamp aging problem. In any scheduler, when an kth packet of session i is served (i.e., transmitted), the time stamp Fik may be stored in a look-up table for later use (as Fikxe2x88x921). The look-up table can be placed in memory for supporting a large number (N) of sessions (or flows), with the entry of Fik addressed by i (where i=0,1, . . . ,Nxe2x88x921). Besides the time stamp Fik, other information related to session (or flow) i can also be stored at (or pointed to from) the same location. Later, when a new packet k of the session (or flow) i arrives at the head of the session queue, and thus becomes the head-of-line (or xe2x80x9cHOLxe2x80x9d) packet, the stored time stamp Fikxe2x88x921 is needed so that it may be compared with the system potential v(aik) for determining a new starting potential Sik for the kth packet as discussed above.
A potential time stamp aging problem exists when updating the starting potential Sik. Recall from equation 3 that a component of the starting potential Sik is the larger of the virtual finish time (or time stamp) of the last sent packet (Fikxe2x88x921) and the system potential v(aik). Since the system potential v(aik) is represented by a finite number of bits in practice, it can xe2x80x9coverflowxe2x80x9d. Given the possibility of system potential xe2x80x9coverflowxe2x80x9d it is impossible to decide, with certainty, which of the finish time potential (or time stamp) of the previous (kxe2x88x921)th packet Fikxe2x88x921 or the system potential v(aik) is greater without any previous history or certain constraints.
In accordance with the present invention, a previous time stamp Fikxe2x88x921 may be considered to be obsolete if the system potential v(aik) exceeds it. That is, once the system potential v(aik) is larger than Fikxe2x88x921, it will remain so. (Naturally, updating will occur when the next packet of the ith session or flow is served.) In the present invention, a number of bits can be used to record (i) a number of overflow events of the system potential v(aik), and (ii) a time zone where the system potential v(aik) and the stored finish potential Fikxe2x88x921, respectively, belong. A purging means may be used to purge all stored time stamps Fikxe2x88x921 that have become obsolete. The purging means should run fast enough to check each of the stored time stamps and purge all obsolete ones before the history of the system potential v(aik) overflows due to its representation by a finite number of bits.
Each purging operation has one, and perhaps two, memory accesses. The first is to read the time stamp Fikxe2x88x921 of the last departed packet. If that time stamp Fikxe2x88x921 is obsolete (i.e., less than the current system potential v(aik)), the second memory access is a write operation to mark the time stamp as obsolete. Due to the limited speed of memory accesses, it might not be possible to complete all purging operations during a time slot, particularly when N is large. Since it might not be possible to perform all N purging operations during a time slot (i.e., it might take a number of time slots to perform all N purging operations), the present invention may track any time stamp or system potential overflow while all purging operations are performed. For example, in the present invention, a first counter variable Cv(t) may be used to track system potential overflow, while another counter variable Ci may be used to track time stamp (or virtual finish time) overflow.