In recent years, implementation of “cloud-based” services, high-performance computing (HPC) and other activities employing data centers and the like have seen widespread adoption. Under a typical data center installation, a large number of servers installed in server chassis and server racks are interconnected in communication using network links (e.g., Ethernet) and various switching mechanisms, such as switch blades/modules and “top-of-rack” (ToR) switches. In some installations, additional links, such as InfiniBand or Fibre Channel, may used for storage and other purposes.
Performance of the network(s) within the data center can be impacted by congestion. During ongoing operations, applications running on compute nodes (typically hosted by physical servers and/or virtual machines, and also referred to as end-nodes, host nodes, or end hosts) send traffic to other applications using a “push” model under which a source node or host pushes traffic toward a destination node or host. Generally, network traffic is sent over Ethernet links using one or more upper layer protocols, such as TCP/IP (Transmission Control Protocol over Internet Protocol). At the Ethernet layer, data is transferred between network ports coupled to the compute nodes/hosts along forwarding paths that may be pre-determined or are dynamically selected based on real-time traffic considerations. Each link has a finite bandwidth, and when the bandwidth is reached, buffering of packets at the switches coupled to the links increase. As the buffers become filled, the network switches attempt to inform sources and/or other switches that the links are congested and that sending traffic toward the congested links should be backed-off. If applicable, the switches will also drop packets, which exacerbates the congestion problems since the dropped packets have to be resent for confirmed delivery protocols such as TCP.
TCP traffic beneficiates from a built-in end-to-end congestion management method that has been enhanced by several techniques. Data Center Transport Control Protocol (DCTCP) is the most recent and the most efficient congestion avoidance variant used in today's cloud data centers. It is a TCP-like protocol for data center networks that leverages Explicit Congestion Notification (ECN) in the network to provide multi-bit feedback to the end-nodes. Unfortunately, DCTCP is not relevant for non-TCP traffic like RDMA over Converged Ethernet (RoCE) or Fibre Chanel over Ethernet (FCoE).
Currently, there are several techniques for addressing congestion management of non-TCP traffic in data centers. For example, Quantized Congestion Notification (QCN) is a standardize method (defined by IEEE802.1Qau) for the network switches (a.k.a. congestion points—CP) to convey congestion notifications back to the source nodes (a.k.a. reaction points—RP). In reaction to the returned Congestion Notification Messages (CNM) the reaction point reduces the transmission rate for the concerned flow(s).
QCN relies on the congested switches or end station buffers to sample outgoing frames and to generate a feedback message (CNM) addressed to the source of the sampled frame. The feedback message contains information about the extent of congestion at the CP. Nominal sampling rate is 1% and it can grow up to 10% when the switch output buffers get very congested.
Unlike DCTCP, QCN relies on,    1) Sampled notifications—instead of systematic reports in DCTCP.    2) Feedback on overall congestion extent at the switch, for all flows altogether—instead of feedback on congestion extent for the specific (Layer4) flow.As a result of this sampling approach, the returned QCN packets provide congestion reports on only a small fraction of the traffic sent. This fraction corresponds to the sampling rate at the switch port (1% to 10%) multiplied by the percentage of the offending L4 flow among all the traffic entering the switch port.
One result of the foregoing approach is that QCN induces much longer converging times for the QCN control loop than DCTCP. It makes QCN efficient only for long-lived data flows, which are generally not common when the data center links operate at 10 Gbps speed and higher. Also, QCN has tentatively addressed its inherent weaknesses by a greater rate decrease factor at the reaction points, which in return led to throughput penalties and longer recovery times at the source nodes. For these reasons, QCN is not performing well for 10 Gbps (and higher rates), and in any case, it is not performing as well as DCTCP.
In addition, since QCN is a Layer2 control protocol, it is limited within the Layer2 Ethernet cloud and it cannot extend beyond the IP subnet, as required for overlays/tunneled environments that carry an IP header while not including a TCP header. As defined by IEEE802.1Qau, QCN is designed to be a self-contained Layer2 solution, being agnostic to higher layer protocols such as FCoE, RoCE, etc.