“Congestion Marking” in data networks is the practice of setting a certain value in a particular field in a packet header to indicate that the packet has experienced congestion at a hop in a network as the packet traverses the network as part of a flow. If the packet continues to experience congestion at a later hop, then the router at the later hop can decide to drop the congestion marked packet, to alleviate congestion within the network as a whole. Congestion marking in Internet Protocol (IP) networks is well known, and typically operates by the network layer router “marking” a packet (in the IP header) when traffic is above a certain threshold. In particularly, IETF RFC 3168 defines the addition of ECN (Explicit Congestion Notification) to IP, making use of the DiffServ Codepoint (DSCP) in the IP header to perform the ECN marking.
The use of congestion marking allows various network control functions to be performed, such as: (1) congestion reactive protocols such as TCP: to perform rate control instead of discarding the packet; (2) PCN (Pre-Congestion Notification): to control admission of new flows and pre-empt existing flows; and (3) Traffic Engineering: control shifting of flows to alternate paths.
ECN techniques typically rely on congestion having occurred at some point in the network in order to trigger the initial ECN marking of packets, and then further congestion occurring to cause ECN marked packets to be dropped. However, in some deployments it may be better to try and detect increasing network traffic and to take remedial action to control the network load before congestion actually occurs, or just as it is starting to occur. To try and address such issues the concept of Pre-Congestion Notification (PCN) has been developed, for use in the context of network layer congestion marking.
PCN (Pre-Congestion Notification) is an End-to-End Controlled Load (CL) Service using Edge-to-Edge Distributed Measurement-Based Admission Control (DMBAC) described in IETF Internet draft Briscoe et al, “An edge-to-edge Deployment Model for Pre-Congestion Notification: Admission Control over a DiffServ Region”, draft-briscoe-tsvwg-cl-architecture-04.txt, available from http://tools.ieff.org/wg/tsvwg/draft-briscoe-tsvwg-cl-architecture-04.txt, and IETF Internet draft B. Briscoe et al, “Pre-Congestion Notification Marking” “draft-briscoe-tsvwg-cl-phb-02.txt”, available from http://www.ietf.org/internet-drafts/draft-briscoe-tsvwg-cl-phb-03.txt. The main objective of MBAC (measurement based admission control) is to guarantee Quality of Service (QoS) requirements, not only for any incoming flow examined for admission, but also for existing admitted flows.
PCN uses two control algorithms: Flow Admission Control and Flow Pre-emption. These are described in more detail below. For further detail reference is made to the two IETF Internet drafts noted above, which are incorporated herein by reference.
In Flow Admission Control, PCN introduces an algorithm that “Admission Marks” packets before there is any significant build-up of CL packets in the queue. Admission marked packets act as an early warning that the amount of packets flowing is close to the engineered capacity. The CL service is achieved “edge-to-edge” across the CL-region, by using DMBAC (Distributed Measurement-Based Admission Control). The decision to admit a new flow depends on measurement of the existing traffic between the same pair of ingress and egress gateways. The CL-region egress calculates the fraction of packets that are marked using a EWMA (Exponential Weighted Moving Average), and reports the value (Congestion Level Estimate) back to the ingress.
Admission marking is performed as follows. In the current PCN design, the router computes the probability that the packet should be admission marked according to the size of the virtual queue at the router, using the following RED-like algorithm:Size of virtual queue<min-marking threshold,probability=0;min-marking-threshold<size of virtual queue<max-marking-threshold−min-marking-threshold),probability=(Size of virtual queue−min-marking-threshold)/(max-marking-threshold−min-marking-threshold);Size of virtual queue>max-marking threshold,probability=1.
This gives a probability function as shown in FIG. 3, with a linearly increasing probability of a packet being marked dependent on the virtual queue size at the router, between the minimum and maximum marking thresholds. In this respect, by “virtual queue” we mean a simulation of the size of the traffic queue at the router, but which is adapted to simulate a slower sending rate (i.e. rate at which the queue empties) than actually occurs for real. Thus, the “virtual queue” is a number which is incremented at the same rate as the arrival of packets at the real queue, but is decremented at a slower rate than the real queue is actually emptied due to actual packets being sent from the real queue. For example, the rate of decrement of the virtual queue may be chosen to be some percentage, say 80%, of the actual router sending rate and hence rate of decrement of the real queue.
The effect of using this slower virtual “sending rate” (i.e. rate at which the virtual queue is decremented) is that it allows potential congestion to be detected earlier than would otherwise be the case by simply looking at the amount of data queued in the real queue. For example, if one looked at the real queue, and was to wait until the real queue was full to detect and then take action against congestion, then in fact congestion has already occurred and packets are being lost due to buffer overflow before remedial action is undertaken. However, by looking at the virtual queue, because the virtual sending rate (i.e. rate at which the virtual queue counter is decremented) is less than the actual sending rate, then for a given packet arrival rate the virtual queue will always be more “full” (i.e. indicate a higher level) than the actual queue. If the packet arrival rate increases such that the virtual queue “overflows” (i.e. reaches a threshold number, which may be equal to the number of packets which the real queue may hold), then this is an indication that congestion may be about to occur in the real queue, and action can be taken. Of course, as the real queue has a higher sending rate than the virtual queue, at the point where the virtual queue overflows the real queue will still have capacity, and hence congestion will not in fact have yet occurred. Thus, the technique of using a virtual queue provides a mechanism by which early detection of potential congestion can take place.
Flow Pre-emption is a scheme that helps to cope with failures of routers and links. New flows are only admitted if there is sufficient capacity, such that the QoS requirements of the new flows and existing admitted flows can be met. The traditional methods of taking into account link failure, mitigation and re-routing, can cause severe congestion on some links, and degrade QoS experienced by on-going flows and other low-priority traffic. PCN uses rate-based flow pre-emption, so that a sufficient proportion of the previously admitted flows are dropped to ensure that the remaining ones again receive QoS commensurate with CL service, and at least some QoS is quickly restored to other traffic classes. The steps to do Flow Pre-emption are: (1) Trigger the ingress gateway to test whether pre-emption may be needed. A router enhanced with PCN may optionally include an algorithm that pre-emption marks packets. Reception of packets thus marked at the egress sends a Pre-emption Alert message to the ingress gateway. (2) Calculate the right amount of traffic to drop. (3) Choose which flows to send. (4) Tear down reservation for the chosen flows. The ingress gateway triggers standard tear-down messages for the reservation protocol in use. Again, further details of Flow pre-emption are given in the IETF drafts referenced above.
It should be noted that in any deployment of PCN only one or other of flow admission control or flow pre-emption may be used, or both control algorithms may be used. In terms of the PCN marking of packets, admission control marks would be used by the admission control algorithm, and pre-emption marks by the flow pre-emption algorithm. Encoding of admission control marks and pre-emption marks into the DSCP is described in B. Briscoe et al, “Pre-Congestion Notification Marking” “draft-briscoe-tsvwg-cl-phb-02.txt”, referenced above.
ECN and PCN are network layer (Layer 3 in the OSI model) protocols, used by network routers for congestion control at the network layer. However, it has recently been proposed to introduce congestion signalling into the data link layer (Layer 2 in the OSI model), to allow a data link layer entity, such as an Ethernet switch, to perform rate control in response to received congestion signals. In particular, a technique known as Backward Congestion Notification (BCN) has recently been proposed for use over Ethernet in datacenter networks.
Backward Congestion Notification (BCN) is a mechanism for datacenter networks (DCNs) that was initially developed by Davide Bergamasco at Cisco Systems, and is presently being looked at by the IEEE802.1/802.3 standards committees. BCN allows congestion feedback (at layer 2 only) from a congestion point (up to several hops away) and rate control at the ingress. It is described in JinJing Jiang, Raj Jain and Manoj Wadekar, Analysis of Backward Congestion Notification (BCN) for Ethernet in Datacenter Applications, submitted to 14th International IEEE Conference on Networking Protocols (ICNP 2006), Santa Barbara, Calif., Nov. 12-15, 2006, available at http://www.cse.ohio-state.edu/˜jain/papers/bcn.htm, and also in the original IEEE presentation given by Bergamasco at the IEEE 802.1 Interim Meeting held in Berlin, Germany, 12 May 2005, and available at http://www.ieee802.org/1/files/public/docs2005/new-bergamasco-backward-congestion-notification-0505.pdf. A brief overview of the operation of BCN is given below.
BCN messages use the IEEE 802.1Q tag format, and the key fields in the BCN message are shown in FIG. 2
A BCN message 30 comprises the following fields. The DA (destination address) 31 of the BCN message denotes the SA (source address) of the sampled frame. The SA (source address) 34 of the BCN message denotes the MAC address of the Congestion Point. The CPID 35 is a congestion point identifier. Field ei 33 gives information about the buffer that is fed back to the source. E.g., from FIG. 1, switch 4 may send a BCN message back to switch 1 (SA, source address of sampled frame), indicating the congestion level at the ingress to switch 4 (which the sampled frame traverses). The field 32 is used to indicate the type of the tag message, i.e. that it is a BCN message. The field C 36 is used to signal the capacity of the congested link.
The field “ei” 33 in the BCN message gives some information about the condition of the memory buffer of the Ethernet switch which generated the BCN message. The value consists of a weighted sum of the instantaneous queue offset and the queue variation over the last sampling intervals, as shown in Equation 1 below:ei=qoff(t)−Wqdelta(t)=(Qeq−q(t))−W(qa−qd)  Eq. 1where W is the weight; qoff(t) is the instantaneous queue offset defined asqoff(t)=q(t)−Qeq  Eq. 2and qdelta is the queue variation over the last sampling interval and is defined as the difference in the number of packets that arrived qa and the number of packets that were served qd since the last sampling event. Here, q(t) is the instantaneous actual queue size, and Qeq is the equilibrium queue size, which would typically be half the total size of the queue.
If ei>0, it indicates that the BCN message is positive. It means there is less potential of congestion soon. ei>0 will be set when either one of the conditions below is met:                1. when the queue length is short and queue is not increasing; or        2. even though the queue length is large at the moment, it is decreasing and so the sources are encouraged to increase their rates        
If ei<0, it indicates that the BCN message is negative. It means there is more potential of congestion soon. ei<0 will be set when either one of the conditions below:                1. Even though the queue is small at the moment, it is increasing and so sources are encouraged to decrease their rates;        2. The large queue indicates the link is congested. The sources are asked to decrease their rates        
Upon receiving a BCN message, the source adjusts its rate using an additive increase, multiplicative decrease (AIMD) algorithm, which uses ei as a parameter. Further details are available in the Jiang et al. paper referenced above.
BCN therefore provides a congestion signalling and sending rate adaptation scheme for use in Ethernet networks, and which can be used to avoid long delays, and minimise loss in layer 2 Ethernet networks. On the other hand, PCN has been proposed to provide for congestion control in the network layer, by virtue of PCN marks being applied to packets then being used to trigger admission control algorithms to prevent admission of new flows when the network reaches its engineered load, and/or flow pre-emption algorithms to “pre-empt”, or terminate, existing admitted flows in the event of router or link failure. To date, however, as shown in FIG. 1, the two schemes, operating as they do in different layers of the protocol stack, have been intended to operate independently from each other, each in its own layer. In this respect, FIG. 1 illustrates how BCN can operate in the data link layer 10 to send BCN messages backwards in the upstream direction from an Ethernet switch which is experiencing congestion to the source switch, whereas PCN can operate separately in the network layer 20.
No interaction between the two mechanisms has heretofore been envisaged. Nevertheless, each mechanism is effectively acting to detect and act upon the same deleterious phenomenon i.e. congestion in the network. Moreover, PCN attempts to avoid congestion completely by detecting when a network is reaching its engineered capacity, and taking action at the network layer, typically in the form of admission control, in response thereto. Therefore, early congestion detection in PCN is useful to attain this objective. However, IP packets in the network layer must be transmitted over a link using the data link layer, and it is in the data link layer traffic queues that congestion will first occur. Therefore, congestion will typically be detectable earlier by monitoring the data link layer traffic queues, as is done by BCN. It would therefore be advantageous to be able to take the earlier congestion detection which is available in a data link layer congestion protocol and feed the congestion indication up into the network layer, such that earlier remedial action can also be then taken in the network layer. More generally, such advantages would also be obtained from taking congestion information from a lower layer in the protocol stack, for use in a higher layer.