In order to reduce congestion at bottlenecks in a network, transmissions for rate-limited data flows of a certain priority are limited by rate limits per data flow that are established based upon feedback from the network. Further, the transmissions for all data flows having the certain priority are controlled or governed by a transmission or flow control per priority.
High link speeds and short delays of data flows are provided by Converged Enhanced Ethernet (CEE) datacenters. CEE datacenters may provide lossless operation and lossless traffic classes beyond the traditional lossy operation, in particular lossy traffic classes.
To avoid network congestion with effects such as head-of-line blocking and saturation trees, lossless CEE operation may require a distributed congestion management (CM) according to IEEE 802.1Qau (QCN) with congestion detection at so-called Congestion Points (CPs), the formation of Congestion Notification Messages (CNMs) sent to traffic sources, and rate limitation at the traffic sources in so-called Reaction Points (RPs).
Congestion Point (CP) is a VLAN-aware bridge or end station port function that monitors a single queue serving one or more priority values. This queue, which may be referred to as CP buffer, can be placed at the output or input of a CEE switch or bridge, or at the input of an end station port function, and may be traversed by traffic from multiple sources and to multiple destinations.
The occupancy of the CP buffer can change due to temporary differences between the overall arrival rate and the overall departure rate in bytes/second. When multiple priority values are supported, a separate CP buffer is typically provided for each priority value. A QCN CP may determine the strength of congestion by taking into account the CP buffer occupancy as well as its rate of change. A CP buffer can be a simple FIFO queue or, more generally, a portion of a Random Access Memory (RAM).
For limiting the transmission rate of frames for one or more congestion-controlled flows in response to receiving CNMs, a Reaction Point (RP) may be used as an end-station port function. RPs may be provided by a CEE-compliant Network Interface Card (NIC) or a Converged Network Adapter (CNA), which is a CEE-compliant NIC that provides additional higher-layer functionality. An RP controls the transmission rate of frames for one or more congestion-controlled flows by applying a rate limit that may be updated dynamically. The rate limit is reduced multiplicatively in response to receiving CNMs from congestion points and increased additively when a number of frames have been transmitted without receiving further CNMs, or when a self-increase timer has elapsed.
CEE switches and end-stations may employ priority-based flow control at their receive queues for lossless operation. Priority-based Flow Control (PFC) may provide an independent flow control for the priority values and their associated receive queues. It prevents frame loss in receive queues due to lack of space by sending PFC pause frames to the upstream sender when one priority queue or multiple priority queues reach a high-water threshold, and by sending PFC unpause frames when the queues are sufficiently drained to reach a low-water threshold. The receive queues may be protected by both PFC and QCN. In such a case, the receive queues are also QCN congestion points as described above. When a PFC pause frame, i.e., a PFC message with a positive pause duration, reaches a CEE-compliant NIC (also referred to as “NIC” in the following), the NIC pauses transmission for priorities that are selected by the PFC pause frame and for a duration specified in the PFC pause frame.
In accordance with QCN, PFC and Enhanced Transmission Selection ETS, a NIC transmitter typically uses a hierarchical scheduler. The hierarchical scheduler may be comprised of a QCN scheduler stage for scheduling flows according to QCN rate limits, and of a PFC/ETS scheduler stage for scheduling priorities.
The QCN scheduler stage may have a flow scheduler for each priority and for each port. The respective flow scheduler may select frames for transmission from rate-limited flow queues by taking into account the earliest next departure time of each rate-limited flow according to its current rate limit. Then, a transmission selection function may select frames for transmission from different priorities or traffic classes, taking into account the pause state of each priority as provided, for example, by PFC. Moreover, the transmission selection function may take into account any scheduling constraints, such as strict priority scheduling or bandwidth allocation imposed on priorities or traffic classes.
An interoperability problem between QCN and PFC may arise, because the rate limits provided for the congestion-controlled flows of a priority are further reduced by inserting transmission pauses for that priority. If QCN establishes the rate limits in a phase with considerable pause activity, then the effective rate limits are actually lower due to the insertion of transmission pauses. For example, if a transmission link priority is used by multiple sources with an oversubscription ratio N greater than 1, then transmission pauses have to be inserted for the priority with a pause on/off pattern, which effectively activates the priority during a fraction 1/N of time.
When QCN has throttled the multiple sources sufficiently, the priority transitions from a PFC-dominated regime to a PFC-free regime. In case of many synchronized sources with a high oversubscription ratio N, the PFC-dominated regime has a correspondingly longer duration. One problem may be that the transition of a priority to the PFC-free regime can occur within a short amount of time, rapidly increasing the effective rate limits for all rate-limited as well as for all non-rate-limited flows of the priority. This may result in further PFC pauses on the adjacent link, aggravates congestion at downstream congestion points, and forces the downstream congestion points to send yet more CNMs. In a system with many congestion points, this interaction between PFC and QCN may reduce system stability and may lengthen the duration of PFC-dominated regimes.
A flow control scheme such as PFC pause is necessary for lossless operation, but is known to introduce side effects such as head-of-line (HOL) blocking and delay jitter. Therefore, in a system providing both QCN and PFC, it may be important to control congestion as much as possible using QCN rate limiting, and to apply PFC pause only as a last resort to avoid frame loss.
The hierarchical scheduling that is necessary for a CEE-compliant NIC transmitter supporting QCN reaction points as well as PFC and ETS may introduce an undesirable coupling between the effective fine-grained per-flow QCN rate limits and the coarse-grained per-priority pause and inter-priority scheduling activities as introduced, for example, by PFC and ETS, respectively. As a result, QCN rate limits of rate-limited flows may become incorrect or inaccurate when pause activity starts or stops gating transmission of an entire priority, or when inter-priority scheduling changes the bandwidth available to a priority.
Accordingly, it is an aspect of the present invention to provide a solution for reducing the undesirable coupling between the effective fine-grained per-flow rate limits and the coarse-grained per-priority pause and inter-priority scheduling activities.