The invention generally relates to a method and system for buffering data packets at a queuing point in a digital communications device such as a network node. More particularly the invention relates to a system for achieving a fair distribution of buffer space between adaptive flows of traffic, the sources of which decrease their transmission rate in response to congestion notification, and non-adaptive flows of traffic, the sources of which do not alter their transmission rate in response to congestion notification.
In order to effect statistical multiplexing in a store and forward digital communications device, such devices will typically queue data packets for subsequent processing or transmission in a common storage resource such as a memory buffer. At such a gateway or queuing point, the common storage resource may be shared by traffic flows associated with various classes of service, interface ports, or some other common attributes which define an aggregation of the most granular traffic flows. With traffic of such a multi-faceted nature, sophisticated communication devices need some type of congestion control system in order to ensure that the common storage resource is xe2x80x9cfairlyxe2x80x9d allocated amongst the various traffic flows.
For example, in an Internet router, the transport level protocol may be some form of TCP (Transmission Control Protocol) or UDP (User Datagram Protocol). The datagrams or packets of such protocols are somewhat different and hence may be used to define different traffic flows. Within each of these protocols the packets may be associated with one of several possible classes or qualities of service which may further define the traffic flows at a hierarchically lower level of aggregation or higher level of granularity. (A number quality of service schemes for the Internet are currently being proposed by various standard-setting bodies and other organizations, including the Integrated Service/RSVP model, the Differentiated Services (DS) model, and Multi-Protocol Label Switching (MPLS), and the reader is referred to Xiao and Lee, Internet QoS: A Big Picture, Department of Computer Science, Michigan State University,  less than http://www.cse.msu.edu/xe2x88x92xiaoxipe/researchLink.html greater than , Sep. 9, 1999, for an overview of these schemes.) Still more granular traffic flows may be defined by packets which share some common attributes such as originating from a particular source and/or addressed to a particular destination, including at the most granular levels packets associated with a particular application transmitted between two end-users.
In an IP router the memory buffer at any given gateway or queuing point may be organized into a plural number of queues which may, for example, hold packets in aggregate for one of the classes of service. Alternatively, each queue may be dedicated to a more granular traffic flow. Regardless of the queuing structure, when the memory buffer becomes congested, it is often desirable to apportion its use amongst traffic flows in order to ensure the fair distribution of the buffer space. The distribution may be desired to be effected at one or more different levels of aggregation, such as memory partitionment between interface ports, and between classes of service associated with any given interface port.
One typically implemented buffer management scheme designed to minimize buffer congestion of TCP/IP flows is the Random Early Detection (RED) algorithm. Under RED, packets are randomly dropped in order to cause different traffic flow sources to reduce their transmission rates at different times. This prevents buffers from overflowing and causing packets to be dropped simultaneously from multiple sources. Such behaviour, if unchecked, leads to multiple TCP sources simultaneously lowering and then increasing their transmission rates, which can cause serious oscillations in the utilization of the network and significantly impede its performance. RED also avoids a bias against bursty traffic since, during congestion, the probability of dropping a packet for a particular flow is roughly proportional to that flow""s share of the bandwidth. For further details concerning RED, see Floyd and Jacobson, Random Early Detection Gateways for Congestion Avoidance, 1993 IEEE/ACM Transactions on Networking.
However, it has been shown that RED does not always fairly allocate buffer space or bandwidth amongst traffic flows. This is caused by the fact that at any given time RED imposes the same loss rate on all flows, regardless of their bandwidths. Thus, RED may accidentally drop packets from the same connection, causing temporary non-uniform dropping among identical flows. In addition, RED does not fairly allocate bandwidth when a mixture of non-adaptive and adaptive flows such as UDP and TCP flows share link resources. TCP is an adaptive flow because the packet transmission rate for any given flow depends on its congestion window size which in turn varies markedly with packet loss (as identified by non-receipt of a corresponding acknowledgement within a predetermined time-out period). UDP flows are non-adaptive because their packet transmission rates are independent of loss rate. Thus, unless UDP sources are controlled through a fair discard mechanism, they compete unfairly with TCP sources for buffer space and bandwidth. See more particularly Lin and Morris, Dynamics of Random Early Detection, Proceedings of SIGCOMM""97.
A variant of the RED algorithm that has been proposed to overcome these problems is the Flow Random Early Drop (FRED) algorithm introduced by Lin and Morris, supra. However, one drawback of FRED is the large number of state variables that needs to be maintained for providing isolation between adaptive and non-adaptive flows. This can prove problematic for high capacity, high speed, routers, and better solutions are sought.
In an asynchronous transfer mode (ATM) communication system, the most granular traffic flow (from the ATM perspective) is a virtual connection (VC) which may belong to one of a number of different types of quality of service categories. The ATM Forum Traffic Management working group has defined five (5) traffic classes or service categories, which are distinguished by the parameter sets which describe source behaviour and quality of service (QoS) guarantees. These categories include constant bit rate (CBR), real time variable bit rate (rtVBR), non-real time variable bit rate (nrtVBR), available bit rate (ABR), and unspecified bit rate (UBR) service categories. The ABR and UBR service categories are intended to carry data traffic which has no specific cell loss or delay guarantees. UBR service does not specify traffic related guarantees while ABR service attempts to provide a minimum useable bandwidth, designated as a minimum cell rate (MCR). The ATM Forum Traffic Management working group and International Telecommunications Union (ITU) have also proposed a new service category, referred to as guaranteed frame rate (GFR). GFR is intended to provide service similar to UBR but with a guaranteed minimum useable bandwidth at the frame or AAL packet level, which is mapped to the cell level by an MCR guarantee.
In an ATM device such as a network switch the memory buffer at any given queuing point may be organized into a plural number of queues which may hold data packets in aggregate for VCs associated with one of the service categories. Alternatively, each queue may be dedicated to a particular VC. Regardless of the queuing structure, each VC can be considered as a traffic flow and groups of VCs, spanning one or more queues, can also be considered as a traffic flow defined at a hierarchically higher level of aggregation or lower level of granularity. For instance, a group of VCs associated with a particular service class or input/output port may define a traffic flow. When the memory buffer becomes congested, it may be desirable to apportion its use amongst service categories, and amongst various traffic flows thereof at various levels of granularity. For instance, in a network where GFR and ABR connections are contending for buffer space, it may be desired to achieve a fair distribution of the memory buffer between these service categories and between the individual VCs thereof.
The problem of providing fair allocation of buffer space to adaptive and non-adaptive flows also exists in ATM systems. With the introduction of IP over ATM, VCs may carry one or more IP flows, where each IP flow can be adaptive or non-adaptive. Thus, some VCs may be adaptive in nature, others may be non-adaptive in nature, while still others may be mixed. A fair allocation of buffer space between such VCs is desired.
A number of prior art fair buffer allocation (FBA) schemes configured for ATM systems are known. One such scheme is to selectively discard packets based on policing. For an example of this scheme in an ATM environment, a packet (or more particularly, xe2x80x9ccellxe2x80x9d as a data packet is commonly referred to at the ATM layer) is tagged (i.e., its CLP field is set to 1) if the corresponding connection exceeds its MCR, and when congestion occurs, discard priority is given to packets having a cell loss priority (CLP) field set to zero over packets having a CLP field set to one. See ATM Forum Technical Committee, (Traffic Management working group living list)xe2x80x9d, ATM Forum, btd-tm-01.02, July 1998. This scheme, however, fails to fairly distribute unused buffer space between connections.
Another known scheme is based on multiple buffer fill level thresholds where a shared buffer is partitioned with these thresholds. In this scheme, packet discard occurs when the queue occupancy crosses one of the thresholds and the connection has exceeded its fair share of the buffer. The fair buffer share of a connection is calculated based on the MCR value of the connection and the sum of the MCRs of all active connections utilizing the shared buffer. However, this technique does not provide an MCR proportional share of the buffer because idle (i.e., allocated but not used) buffer, which can be defined as             ∑              i        =        1            N        ⁢          xe2x80x83        ⁢          max      ⁡              (                  0          ,                                                                      MCR                  i                                                                      ∑                    active                                    ⁢                                      xe2x80x83                                    ⁢                  MCR                                            ⁢                              Q                s                                      -                          Q              i                                      )              ,
where Qs is the buffer fill level, Qi is the buffer segment count for a connection i, and             MCR      i                      ∑        active            ⁢              xe2x80x83            ⁢      MCR        ⁢      Q    s  
is the fair share of buffer allocated to the connection, is distributed at random between the connections.
Another scheme for fairly allocating buffer space through selective discard is based on dynamic per-VC thresholds. See Choudhury, A. K., and Hahne, E. L., xe2x80x9cDynamic Queue Length Threshold in a Shared Memory ATM Switchxe2x80x9d, Proceedings of I.E.E.E. Infocom 96, March 1996, pages 679 to 686. In this scheme the threshold associated with each VC is periodically upgraded based on the unused buffer space and the MCR value of a connection. Packet discard occurs when the VC occupancy is greater than the VC threshold. This method reserves buffer space to prevent overflows. The amount of reserved buffer space depends on the number of active connections. When there is only one active connection, the buffer is not fully utilized, i.e., full buffer sharing is not allowed.
In conclusion, some of the above-mentioned prior art does not fairly distribute buffer space or idle buffer space between traffic flows. Other prior art buffer management schemes also do not allow for full buffer sharing. Another drawback with some prior art buffer management schemes is that they do not address the allocation of buffer space to contending traffic flows defined at multiple levels of aggregation/granularity. The invention seeks to overcome or alleviate some or all of these and other prior art limitations.
In what follows, unless the context dictates otherwise, the term xe2x80x9ctraffic flowxe2x80x9d refers to the most-granular flow of packets defined in a buffer management system. Designers may use their discretion to define the most-granular flow. The term xe2x80x9ctraffic flow setxe2x80x9d refers to an aggregation or grouping of the most-granular traffic flows. In the context of the present invention, a traffic flow set may also consist of a single traffic flow. Thus a traffic flow set as understood herein comprises one or more traffic flows.
Broadly speaking, one aspect of the invention relates to a method of processing packets at a queuing point in a communications device having a shared memory buffer. The method includes receiving and associating packets with one of a plurality of traffic flow sets. These sets are defined so as to logically contain either adaptive traffic flows or non-adaptive traffic flows, but not both. Each traffic flow set is associated with a target memory occupancy size which is dynamically computed in accordance with a pre-determined dynamic fair buffer allocation scheme, such as a preferred recursive fair buffer allocation method described herein. When any one the traffic flow sets is in a congested state, packets associated therewith are discarded. Congestion is preferably deemed to occur when the actual memory occupancy size of a given traffic flow set reaches the target occupancy size thereof. In addition, packets are randomly discarded for at least the traffic flow sets containing adaptive traffic flows, or alternatively all traffic flow sets, prior to the sets becoming congested. The probability of packet discard within a given traffic flow set is related to the target memory occupancy size thereof. This is preferably subject to the constraint that the probability of packet discard for a given traffic flow set is zero if the target memory occupancy size thereof is below a threshold value (indicative of a relatively non-congested buffer), and reaches one when the given traffic flow set is congested.
The foregoing enables a buffering system operating in accordance with the method to obtain the benefits of random early detection or random early discard since sources of traffic are randomly notified of impending congestion, thereby preventing serious oscillations of network utilization. Some of the drawbacks of the prior art are also avoided since the method ensures that no sources, especially non-adaptive traffic flow sources, consume excessive buffer space due to the fluctuating transmission rates of the adaptive traffic flows. This is due to the logical isolation between adaptive and non-adaptive traffic flows and the fair discard policy enforced by the buffer allocation scheme. Furthermore, unlike the prior art the probability of packet discard is not static as in the prior art but rather dynamic in that it is based on the dynamic target occupancy size. This enables the buffer to be utilized to the maximum extent possible under the selected fair buffer allocation scheme.
Potential fair buffer allocation schemes which can be employed by the method include those schemes described in:
Choudhury and Hahne, xe2x80x9cDynamic Queue Length Thresholds in a Shared Memory ATM Switchxe2x80x9d, (copyright)1996 IEEE, Ref. No. 0743-166X/96; and
both of which are incorporated herein by reference.
In various embodiments described herein the method employs a novel fair buffer allocation scheme disclosed in applicant""s co-pending patent application, U.S. Ser. No. 09/320,471 filed May 27, 1999, which is also described in detail herein. In this scheme the memory buffer is controlled by defining a hierarchy of memory partitions, including at least a top level and a bottom level, wherein each non-bottom level memory partition consists of one or more child memory partitions. The size of each top-level memory partition is pre-determined, and a nominal partition size for the child partitions of a given non-bottom level memory partition is dynamically computed based on the congestion of the given memory partition. The size of each child memory partition is dynamically computed as a weighted amount of its nominal partition size. These steps are iterated in order to dynamically determine the size of each memory partition at each level of the hierarchy. The memory partitions at the bottom-most level of the hierarchy represent space allocated to the most granular traffic flows defined in the system, and the size of each bottom-level partition represents a memory occupancy threshold for such traffic flows.
The memory partitions are preferably xe2x80x9csoftxe2x80x9d as opposed to xe2x80x9chardxe2x80x9d partitions in that if the memory space occupied by packets associated with a given partition exceeds the size of the partition then incoming packets associated with that partition are not automatically discarded. In the embodiments described herein, each memory partition represents buffer space allocated to a set of traffic flows defined at a particular level of granularity. For instance, a third level memory partition may be provisioned in respect of all packets associated with a particular egress port, and a more granular second level memory partition may be associated with a subset of those packets which belong to a particular class of service. Therefore, the size of a given partition can be viewed as a target memory occupancy size for the set of traffic flows corresponding to the given partition. At the lowest level of the hierarchy, however, the partition size functions as a threshold on the amount of memory that may be occupied by the most granular traffic flow defined in the system. When this threshold is exceeded, packet discard is enabled. In this manner, aggregate congestion at higher levels percolates down through the hierarchy to effect the memory occupancy thresholds of the most granular traffic flows. The net result is a fair distribution of buffer space between traffic flow sets defined at each hierarchical level of aggregation or granularity.
In the illustrative embodiments, one or more of the memory partitions at any given hierarchical level is allocated to adaptive traffic flows and non-adaptive traffic flows. Packets associated with memory partitions at a pre-determined hierarchical level are randomly discarded prior to those partitions becoming congested, with the probability of discard being related to the size thereof.
Another aspect of the invention relates to a method of buffering data packets. The method involves:
(a) defining a hierarchy of traffic flow sets, the hierarchy including at least a top level and a bottom level, wherein each non-bottom level traffic flow set comprises one or more child traffic flow subsets and wherein at one non-bottom hierarchical level each set with a group of traffic flow sets comprises either adaptive flows or non-adaptive flows (but not both);
(b) provisioning a target memory occupancy size for each top-level traffic flow set;
(c) dynamically determining a target memory occupancy size for each traffic flow set having a parent traffic flow set based on a congestion measure of the parent traffic flow set;
(d) measuring the actual amount of memory occupied by the packets associated with each bottom level traffic flow;
(e) enabling the discard of packets associated with a given bottom level traffic flow set in the event the actual memory occupancy size of the corresponding bottom level traffic flow exceeds the target memory occupancy size thereof thereby to relieve congestion; and
(f) enabling packets associated with the traffic flow sets containing adaptive flows to be randomly discarded prior to the step of discarding packets for congestion relief.
In the embodiments described herein, the target memory occupancy size for a given traffic flow set is preferably computed by first computing a nominal target occupancy size for the child traffic flow sets of a common parent. The target memory occupancy size for each such child traffic flow set is then adjusted to a weighted amount of the nominal target occupancy size. The nominal target occupancy size for a given group of child traffic flow sets preferably changes in accordance with a pre-specified function in response to the congestion of their common parent traffic flow set. In some of the embodiments described herein, congestion is defined as a disparity between the target and measured memory occupancy sizes of a parent traffic flow set, and geometric and decaying exponential functions are deployed for computing the nominal target occupancy size for the child sets thereof.
The invention may be implemented within the context of an ATM communications system as disclosed herein. In these embodiments, the comparison specified in step (e) is preferably carried out prior to or upon reception of the first cell of an ATM adaptation layer (AAL) frame or packet in order to effect early packet discard in accordance with the outcome of the comparison.
In various embodiments disclosed herein, the bottom-level traffic flow sets are logically isolated so as to encompass either adaptive flows or non-adaptive flows, but not both. Random early discard is applied as discussed in greater detail below to at least the traffic flow sets at a pre-selected hierarchical level which contain adaptive flows, such as VCs which carry TCP/IP traffic. Alternatively, random early discard may be applied to all traffic flow sets at a pre-selected hierarchical level. This may be desired if, for instance, it is not known a priori which VC will be carrying TCP/IP traffic and which will be carrying UDP traffic. In either case, the probability of discard is preferably related to the target memory occupancy size of the traffic flow sets at the pre-selected hierarchical level.
The buffering system according to this aspect invention scales well to large systems employing many hierarchical levels. This is because there are relatively few state variables associated with each hierarchical level. In addition, most computations may be performed in the background and lookup tables may be used, thereby minimizing processing requirements on time critical packet arrival. This system also enables full buffer sharing, as discussed by way of an example in greater detail below.