Computer networks are ubiquitous to today's computer usage. Increases in processor speeds, memory, storage, and network bandwidth technologies have resulted in the build-out and deployment of networks with ever increasing capacities. More recently, the introduction of cloud-based services, such as those provided by Amazon (e.g., Amazon Elastic Compute Cloud (EC2) and Simple Storage Service (S3)) and Microsoft (e.g., Azure and Office 365) has resulted in additional network build-out for public network infrastructure, and addition to the deployment of massive data centers to support these services which employ private network infrastructure. Additionally, the new generation (i.e., 4G) of mobile network data services is expected to significantly impact the utilization of both wireless and land-line networks in the near future. The result of these and other considerations is that the utilization of computer networks is expected to continue to grow at a high rate for the foreseeable future.
The key components for facilitating packet forwarding in a computer network are the switching elements, which are generally referred to herein as network elements and include switches, routers and bridges. A switch has multiple input and output ports, each connected via a link to another switch or other type of network element, wherein inbound packet traffic is received at the input ports and forwarded out of the output ports. Generally, the number of physical input and output ports is equal, and the amount of traffic received at or forwarded out of a given port relative to the other ports is variable. Internally, the input and output ports of a switch are logically connected such that each input port is connected to each output port in a one-to-many configuration. Each of the input and output ports have buffers for temporarily storing (i.e., buffering) packets, and the switch has other intermediate output buffers and/or queues typically associated with different flow classes, Quality of Service (QoS) levels, etc. Under a typical configuration packets received at a given input port are initially buffered in an input buffer associated with the input port and classified. Once classified, the packet may be buffered along with other packets classified to the same class or flow in an intermediate output buffer allocated to the class or flow and/or associated with an output port via which the packet is to be forwarded. The packet may then be copied from the intermediate buffer to a smaller output buffer for the port or otherwise the packet data in the intermediate output buffer may be forwarded to the Physical (PHY) layer interface of the port and converted to an electrical analog, optical, or wireless signal for transmission over the link.
Even as networks get faster, congestion continues to exist. On some levels, network congestion is analogous to vehicle traffic congestion on freeways. For example, consider multiple onramps to a section of freeway. The incoming traffic comes from multiple paths and is merged into a single traffic flow. Similarly, packets may be received by a network element at multiple input ports, yet be forwarded via a single output port. The multiple input ports are analogous to the onramps and the merged traffic flow is analogous to the traffic flow forwarded via the output port.
Traffic management on freeways is often handled by metering the flow of traffic entering via the onramps. Under this approach, stop and go light signaling is used to control the rate of traffic entering the merged flow. Like freeways, network switches have limited throughput. However, unlike freeways, where the traffic is simply slowed when congested and all vehicles are allowed to proceed, switches operate at line-speeds that are a function of the underlying physical (layer) transport, such as 1, 10 or 40 Gigabytes per second (Gbps). The way fluctuation in traffic loads is handled is by buffering the (to be merged via an output port) traffic in the output and intermediate buffers. However, these buffers are limited in size, and measures must be taken to prevent buffer overfill.
When an output buffer approaches it capacity, the switch generally takes one or more of the following actions. First, it may drop incoming packets. It may also issue flow control notifications to its peer switches, and/or send backward congestion notifications to the congesting source nodes so that it reduces the throughput of the congested flow(s). Oftentimes, these actions are either not efficient or generate undesirable side effects.
In reliable transport protocols like TCP (Transaction Control Protocol), packets drop involves retransmission of the dropped packets, which increases end-to-end latency of the whole data transfer, especially if the packet drop is detected by some timer expiration at the destination end nodes. Also, dropped packets may result in reduction of the TCP window size, which reduces the throughput even if the congestion was just transient. Meanwhile, flow control may lead to congestion spreading backward in the network. All incoming traffic to the congested switch may rapidly be stopped because flow control notifications issued to the peers are generic XOFF/XON requests, which do not differentiate between traffic flows that are destined to the congested output buffer and those that are not. Later on, if the congested situation persists at the hot spot, the peer nodes themselves get congested as they cannot sink traffic that was destined to the congested switch; and so forth throughout the whole network.
Sending backward congestion notifications to the congesting sources (like the QCN (Quantized Congestion Notification) protocol does) defines a network-wide control loop that may not be efficient in case the congestions are just transient conditions and/or in case the congesting flows are not long-lived enough with regard to the round trip delay between the hot spot and the source nodes. Note that for transport protocols which unlike TCP will start flow transmission at the full link speed without handling any congestion window (e.g., FCoE (Fibre Channel over Ethernet) transport), the flow duration is generally too small with regard to the round trip delays, and therefore there is not enough time for backward congestion management loops to be established. This is especially true when link speeds are 1 Gbps and beyond. On the other hand, when the transport protocol implements a slow start mechanism like TCP does, end-to-end data transfer delays are unnecessarily extended a priori until the congestion window algorithm stabilizes the flow transmission rate around the network capacity.