We start by presenting, as background information, some basic concepts to facilitate the understanding of the numerous monitoring and policing techniques that are presented afterwards.
Packets
A data sender usually splits data to be sent into small units known as packets. Each packet consists of a header and a payload carrying the data to be delivered. The header contains fields defined by the relevant communication protocol. The great majority of packets carried by commercial networks nowadays are so-called IP packets. IP is the Internet Protocol. This ensures that a network of routers can forward any packet from the source to its destination. IP is a connectionless protocol—that means that the header information in each data packet is sufficiently self-contained for routers to deliver it independently of other packets; each packet could even take a different route to reach the destination.
Distributed Bandwidth Sharing and Congestion
Data traversing the Internet follows a path between a series of routers, controlled by various routing protocols. Each router seeks to move the packet closer to its final destination. If too much traffic traverses the same router in the network, the router can become congested and packets start to experience excessive delays whilst using that network path. Between routers, data also traverses switches and other networking equipment that may also become congested. Throughout the following description the term router congestion will be used to imply congestion of any network equipment, without loss of generality. If sources persist in sending traffic through that router it could become seriously overloaded (congested) and even drop traffic (when its buffers overflow). If sources still persist in sending traffic through this bottleneck it could force more routers to become congested, and if phenomenon keeps spreading, that can lead to a congestion collapse for the whole Internet—which occurred regularly in the mid-1980s.
The solution to that problem has been to ensure that sources take responsibility for the rate at which they send data over the Internet by implementing congestion control mechanisms. Sources monitor feedback from the receiver of the metric that characterises path congestion in order to detect when the path their data is following is getting congested, in which case they react by reducing their throughput—while they may slowly increase their rate when there is no sign of the path becoming congested.
The typical path characterisation metrics that sources monitor are the average roundtrip time (RTT) for the data path, the variance of the roundtrip time (jitter) and the level of congestion on the path. Congestion is one of the parameters controlling rate adaptation of a source sending data over a congested path.
The congestion level can be signalled either implicitly (through congested routers dropping packets) or explicitly (through mechanisms such as explicit congestion notification see next subsection). Currently the most common option is implicit signalling.
Sources using TCP are able to detect losses, because a packet loss causes a gap in the sequence; whenever a TCP source detects a loss, it is meant to halve its data transmission rate, but no more than once per round trip time, which alleviates the congestion on the router at the bottleneck.
Random Early Detection (RED)
Historically, routers would drop packets when they got completely saturated (which happens when a traffic burst cannot be accommodated in the buffer of the router) this policy is called drop-tail. Random early detection (RED) (discussed in reference “[RED]”, bibliographic details of which are given later) is an improvement whereby routers monitor the average queue length in their buffer and when the average queue is higher than a given threshold, the router starts to drop packets with a probability which increases with the excess length of the queue over the threshold (see FIG. 3). RED is widely used in today's. Internet because it avoids all flows receiving congestion signals at the same time (termed synchronisation) which would otherwise cause oscillations. RED also allows sources to react more promptly to incipient congestion and it keeps queues from growing unnecessarily long.
Explicit Congestion Notification
Explicit Congestion Notification (ECN) (see reference “[RFC3168]”) further improves on RED by using a two-bit ECN field in the IP header to signal congestion. It runs the same algorithm as RED, but instead of dropping a packet, it sets its ECN field to the Congestion Experienced (CE) codepoint. The ECN standard requires the receiver to echo any congestion mark signalled in the data; for instance, a TCP receiver sets the Echo Congestion Experienced (ECE) flag in the TCP header, which the TCP source interprets as if the packet has been dropped for the purpose of its rate control. In turn the source then reacts to the congestion by halving its transmission rate.
ECN was originally defined for DECnet, the proprietary networking protocol developed by the Digital Equipment Corporation [DECbit]. As well as the idea being adopted in IP, it was also adopted in Frame Relay and ATM, but in these latter two protocols the network arranges feedback of the congestion signals internally, and the network enforces traffic limits to prevent congestion build-up (see [ITU-T Rec.I.371]).
The IEEE has standardised an explicit congestion approach where Ethernet switches not the end systems arrange to feedback the congestion signals, although the Ethernet device on the sending system is expected to co-operate by reducing its rate in response to the signals. The approach is tailored exclusively for homogeneous environments, such as data centres.
In the previously described approaches, each frame (or packet) carried just a binary flag and the strength of the congestion signal depended on the proportion of marked frames—effectively a unary encoding of the congestion signal in a stream of zeroes and ones. However, the IEEE scheme signals a multibit level of congestion in each feedback frame, hence its common name: quantised congestion notification or QCN (see [IEEE802.1Qau]).
Re-ECN
Re-ECN (see [re-ECN]) utilises a technique called re-feedback (discussed in [re-feedback] and in International application WO2005/096566) whereby packets indicate the congestion they are likely to experience on the rest of their path not just the congestion already experienced, that ECN indicates. It is similar to ECN but uses an extra unused bit in the packet header. This bit is combined with the two-bit ECN field to create four extra codepoints, as discussed in International application WO2006/079845.
The simplest way to understand the protocol is to think of each packet as having a different “colour” flag (where different “colours” correspond to different codepoints). At the start of a flow, a “green” flag (“FNE”, meaning “Feedback Not Established”) is used to indicate that the sender does not have sufficient knowledge of the path. Green flags are also used when the sender is unsure about the current state of the path.
By default packets are marked with “grey” flags. If they encounter congestion during their progress through the network the ECN marking applied by the congested router will be termed a “red” flag. The destination will feed back a count of the number of red flags it has seen. For every red flag it is informed of, the sender should mark an equivalent number of bytes it sends in a subsequent packet or packets with a “black” flag. The black flag re-echoes or reinserts the congestion feedback back into the forward-travelling stream of packets, hence the name “re-ECN”. These black flags may not be modified once they have been sent by the sender. There is a small possibility that a black packet will in turn be marked red by a congested router, but the codepoints are chosen so that it is still possible to tell the packet was originally marked as black—such packets are described as coloured “black-red”.
At any intermediate node the upstream congestion is given by the proportion of red flagged bytes to total bytes. Thus the continually varying congestion level is effectively encoded in a stream of packets by interpreting the stream of red or non-red markings as a unary encoding of ones or zeroes respectively. Similarly, the congestion level of the whole path is encoded as a stream of black or non-black markings. The expected downstream congestion from any intermediate node can then be estimated from the difference between the proportions of black flags and of red flags, as described in International application WO2006/079845.
The IETF is in the process of defining an experimental change to the Internet protocol (IP) based on re-ECN, called Congestion Exposure (ConEx) (see [ConEx-abstr-mech]).
Discussion of Prior Techniques
The distributed congestion control responses to congestion of every data source do not necessarily share bandwidth equitably or efficiently. Firstly this approach relies on sources voluntarily responding in the prescribed way to the presence of congestion. Secondly, even if all sources respond as prescribed, the source of every data flow would not be taking account of how active or inactive it had been over time relative to others. Thirdly, equity should be judged between entities with real-world identities (e.g. users or customers of a network) not abstract data flows. Otherwise some real world entities can simply create many more data flows than others.
Due to this, network operators generally limit usage of a shared network. This is generically termed “policing”.
The physical capacity of a communications link provides a natural physical limit on the bit-rate that the users of that link can achieve. The link provided to attach a customer site (e.g. home or business) to a network physically limits (or physically polices) the customer's traffic.
However, often a logical rather than a physical limit is placed on the bit-rate to or from a customer site. This is because, as the peak bandwidth of access links has increased with advances in technology, average access link utilisation has decreased (currently 1% average utilisation during the peak period is typical). Therefore, when traffic from a large number of customers with low average utilisation is aggregated deeper into the network, it is uneconomic to provision shared capacity for the eventuality that every user might briefly use 100% of their access capacity.
Typically the average traffic from a large aggregate of customers is fairly predictable. It is possible for a network operator to provide enough capacity for this average, plus enough headroom to allow for daily variation. However, at peak times, everyone's experience is then determined by the heaviest users how many there are and how heavy they are.
Policing
A number of means have been devised to logically police usage of share capacity. Some are used in production networks, others are merely research proposals:
Token-Bucket Policing:
With reference to FIG. 1a, and as further discussed in [Turner86], the network operator allocates each customer i a contracted rate ui and a contracted burst size bi. A token bucket policer is associated with each customer, which is essentially an account that stores a single number di that characterises the customer's recent activity. Conceptually, di is the time-varying depth of fill of the customer's token bucket, which is filled with tokens at constant rate ui and can store up to bi tokens. A meter measures the customer's traffic and removes tokens from the bucket for every byte transferred. Therefore, a customer sending at time-varying bit-rate xi will remove tokens from the bucket at rate xi.
A policer regulates the rate yi at which the customer can send traffic dependent on the fill depth di of the bucket. As long as the bucket is not empty (di>0), the policer does not impede the customer's data flow xi. But whenever there are insufficient tokens in the bucket (di=0), arriving data is discarded. If the customer is under-utilising the contract, the bucket will be full, and further tokens filling the bucket will be discarded.
Variants are possible: e.g.:                an overdraft at the bottom of the bucket where the probability of discard increases with the depth of the overdraft;        delay rather than discard (termed shaping rather than policing)        marking as out of contract, rather than discard (see RIO below).Paired Token Buckets:        
A customer may be offered an assurance that they will always be able to use a certain bit-rate (their committed information rate or CIR), but they will also be allowed to use up to a peak information rate (PIR) if shared capacity is available. The two rates are also associated with allowed burst sizes above the rate: respectively the committed burst size (CBS) and the peak burst size (PBS). A CIR/PIR contract is generally policed by paired token buckets, filled respectively at the two rates and with depths of the two burst sizes respectively. This arrangement is typically called a three colour marker (TCM), because they often mark (or ‘colour’) outgoing traffic with one of three different classes of service depending on whether both, one or neither bucket is empty [RFC2697, RFC2698].
The CIR/PIR approach was common in Frame Relay and ATM, and it is common today on a shared link where the access capacity technology includes a mechanism for sharing out the capacity (e.g. time-division multiplexing in cable networks or passive optical networks and code-division multiplexing in cellular networks). Link capacity is provisioned so that it can support the sum of all the committed information rates. The approach is also used for whole networks, not just links, for example differentiated services (DiffServ) networks [RFC2698]. For a network, the committed rate may not be guaranteed—shared capacity may be provisioned so that it has a high probability of satisfying all the committed rates.
Another variant assures just one committed rate not two, with one associated committed burst size, but also a peak burst size is also allowed if available capacity permits [RFC2697].
RED with in/Out (RIO):
RIO (further discussed in [Clark98]) separates the decision on which traffic is out of contract from the decision on whether to sanction out of contract traffic. It comes in two variants, each conceptually the dual of the other:                Sender-based policing: At the ingress to a network any of the above policer designs can be used to determine which traffic is in-contract and which out. But out-of-contract traffic is merely tagged as such, rather than discarded. In fact, the customer can tag their own traffic to indicate which out-of-contract packets are least important to them; then the network operator merely has to check that the traffic tagged as in-contract, does actually fit within the contracted traffic profile.        
If there is congestion at any forwarding node deeper into the network, packets tagged as out-of-contract can be discarded preferentially before in-contract packets are discarded. The RIO scheme proposed that nodes deeper into the network could simply run two instances of the RED algorithm, one with aggressive thresholds for out-of-contract traffic, and the other with a regular threshold configuration.                Receiver-based policing: In this variant, the traffic is probabilistically marked with standard explicit congestion notification (ECN) if it passes through a congested queue. Then just before arriving at the receiver, the traffic is compared against the contracted profile using one of the policing techniques described above. But instead of discarding packets or tagging them out-of-contract, ECN markings are removed for all traffic within the profile.Weighted RED (WRED):        
WRED (further discussed in [WRED_Cisco]) is a variant of the sender-based version of RIO that has been widely implemented. Like RIO, on entry to the network, traffic is policed to a contract agreed with the customer. And like RIO, the policer tags rather than discards traffic that is out of profile. But rather than tag traffic as either in or out-of-contract, a WRED policer demotes out-of-contract traffic using potentially eight traffic class identifiers. For IP differentiated services, three classes are typically used, as standardised for the assured forwarding class of DiffServ [RFC2597]).
On interior routers, up to eight different sets of RED thresholds are configured for each class and one algorithm (rather than the two of RIO) determines the average queue length. Then each packet is compared against the thresholds relevant to its class, so that packets demoted to lower precedence classes will be more likely to be dropped.
U.S. Pat. No. 6,904,015 (Chen et al), entitled “Congestion avoidance profiles in a packet switching system”, relates to a technique for implementing the weighted RED algorithm in hardware. In Chen's technique, a traffic conditioner stores a drop probability profile as a collection of configurable profile segments. A multi-stage comparator compares an average queue size (AQS) for a packet queue to the segments, and determines which segment the AQS lies within. This segment is keyed to a corresponding drop probability, which is used to make a packet discard/admit decision for a packet. In a preferred implementation, this computational core is surrounded by a set of registers, the purpose of which is to allow it to serve multiple packet queues and packets with different discard-priorities.
Bottleneck Flow Policing:
A technique sometimes referred to as penalty box policing [Floyd99] involves monitoring the discards from a FIFO queue to identify whether packets from particular flows are more prevalent among the discards than others. Numerous variants and improvements to the original idea were subsequently published, such as RED with Preference Dropping (RED-PD [Mahajan01]), Least Recently Used RED (LRU-RED [Reddy01]), XCHOKe [Chhabra02], and Approx. Fair Dropping (AFD [Pan03]).
The intent of these bottleneck flow policing mechanisms is to identify application data flows with a higher bit-rate than other flows, in order to police their rate down to the same as every other flow.
In-Band-Congestion-Token-Bucket Policing:
Referring to FIG. 1b, this is similar in operation to token bucket-policing, but it takes account of traffic only if it contributed to congestion. A prerequisite is that the proportion of the traffic's contribution to congestion elsewhere must have been tagged onto the traffic itself, as in-band congestion signalling. This is discussed further in [Jacquet08] and International application WO2006/082443,
Typically each packet can either be marked or not, with a probability proportional to the congestion it has contributed to. This might be achieved with explicit congestion notification (ECN [RFC3168]) or congestion exposure (ConEx [ConEx-abstr-meth]). The meter measures only congestion marked packets and ignores the rest. It removes tokens from the congestion-token-bucket only for the bytes of marked packets. The network operator allocates each customer i a contracted congestion-bit-rate of zi and a contracted congestion burst size ci. Conceptually these are represented by a fill-rate and depth as with the traditional token bucket. Again, when a customer's congestion-token-bucket is empty, the policer limits their bit-rate.
Variants are possible:                A dual token bucket might be used in which, as well as a defined token-fill-rate, the token-drain-rate is limited to a maximum. International application WO2010/109201 discusses this.        Instead of a binary congestion marking, each packet might be tagged with a real number between 0 & 1 signifying the level of congestion it has experienced. For example the feedback frames in quantized congestion notification (QCN) [IEEE802.1Qau] are tagged in this way. Then, the meter would count the congestion-bytes to be removed from the bucket as the number of bytes in a data frame multiplied by the numeric congestion level associated with the frame.Weighted Fair Queuing (WFQ):        
Referring to FIG. 1c, WFQ partitions capacity between the entities actively using a link, without wasting capacity on inactive entities. Entities might be defined as whole customers or individual data flows. Each entity is associated with a weight, so that deliberately unequal shares can be provided. Traffic from each active entity is partitioned into separate queues. Access to the shared line is arbitrated by a scheduler, which serves each queue for a certain proportion of time, wi/Σw, where wi is the weight associated with entity i and Σw is the sum of the weights of all active entities. This gives each customer an assured minimum proportion of the link capacity Y, equal to wi Y/Σw. If a customer sends more than this, their queue just builds up. If they send less, their queue drains and whenever their queue empties, even in the brief periods between packets, the scheduler will give the other customers a higher proportion of the link, because Σw will not include the inactive user's weight while they have no packet waiting in the queue. This is discussed further in [WFQ89] [WFQ_Cisco].
Deep Packet Inspection (DPI):
DPI machines use network processors to reconstruct application layer packet streams and identify which packets belong to which applications. It is then possible for the network operator to configure policies that discriminate against certain applications, which it infers are likely to occupy large amounts of capacity, but may not be particularly highly valued by most customers.
DPI machines are also configured to be able to recognise traffic from each individual customer and count total volume, or volume of a particular application(s) against each customer's account. A common approach is to combine the capabilities of the DPI machine to only limit the peer-to-peer file-sharing traffic of those users that have contributed a large proportion of the total traffic volume during the peak period of the day.
There is no standard DPI machine, the approach being entirely proprietary. But generally, the packet classification stage can be thought of as similar to the stage of all the schemes so far described that checks whether arriving traffic fits a profile, allowing traffic to be classified as in or out-of-contract. Alternatively, as with WRED, a spectrum between in and out can be defined.
Having classified how well traffic complies with a traffic contract, DPI boxes then use the full range of techniques already described to degrade out of contract traffic, ranging from discard to tagging for potential treatment elsewhere in the network if necessary.
In addition, DPI boxes may route certain classifications of traffic differently to improve or degrade its service.
Comcast's Protocol-Agnostic Congestion Management System:
Comcast's system (see [Fairshare]), developed in conjunction with Sandvine, takes the following steps:                It measures the volume of (upstream) traffic from each customer over a period of a few minutes and records the most recent per-customer metric.        The network monitors whether a particular segment is becoming congested.        If it is, the system identifies those users of that segment who have contributed most traffic in the recent past and assigns all their traffic to a lower priority class for a brief period.        Whenever the segment becomes congested, those users' traffic will then receive lower priority service than everyone else, and therefore may be delayed or dropped.        Once those customers reduce their contribution below a threshold, they are no longer assigned lower priority.        