Communication in a computer network involves the exchange of data between two or more entities interconnected by communication links. These entities are typically software programs executing on computer platforms, such as end nodes and intermediate nodes. An example of an intermediate node may be a router or switch which interconnects the communication links to enable transmission of data between the end nodes, such as a server having processor, memory and input/output (I/O) storage resources.
Communication software executing on the end nodes correlate and manage data communication with other nodes. The nodes typically communicate by exchanging discrete packets or frames of data according to predefined protocols, such as the Transmission Control Protocol/Internet Protocol (TCP/IP). In this context, a protocol consists of a set of rules defining how the nodes interact with each other. In addition, network software executing on the intermediate nodes allows expansion of communication to other end nodes. Collectively, these entities comprise a communications network and their interconnections are defined by an underlying architecture.
The InfiniBand architecture is an I/O specification that defines a point-to-point, “switched fabric” technology used to, among other things, increase the aggregate data rate between processor and/or storage resources of a server or set of servers. The switched fabric generally comprises multiple point-to-point links that cooperate to provide a high-speed interconnect that may also be used to link individual servers into clusters to increase availability and scalability. The switched fabric technology may be embodied in an InfiniBand switch (hereinafter “IB switch”) configured to receive data traffic (packets) from one or more input ports and forward that traffic to one or more output ports. A forwarding decision, i.e., the decision to switch a packet received at an input port to an output port, is rendered on an address contained in a predetermined field of the packet.
Regulation of data traffic over a communications network having finite resources is known as flow control. These resources may be measured in capacity, speed or any other parameter that can be quantified. A need for flow control arises whenever there is a constraint on the communication rate between two nodes due to a limited capacity of bandwidth or processing resources. At that time, a flow control scheme is required to prevent congestion and provide a high percentage of network utilization. Congestion occurs when two or more flows reach a common “bottleneck” point in the network that cannot support the total aggregate of the flows beyond that point. When that happens, the flows must be “throttled” down to a level that can be supported by the bottleneck point. Flow control is then used to communicate to the source the share of the bottleneck resource that is available for that source.
A simple conventional flow control scheme involves a destination end node (destination) sending a signal to a source end node (source) to essentially stop the source from transmitting its flow of data traffic over a link. This flow control scheme, referred to as link-level on/off flow control, involves the setting of a threshold level on a receive buffer at the destination. The destination generates a feedback flow control signal (e.g., an ON/OFF signal) that instructs the source to stop transmission of the data flow over the link when the threshold level is exceeded. Here, link level refers to a physical link between the source and destination nodes that, in this context, may further include switches. End-to-end control in this technique is achieved through a series of “hop-by-hop” link level flow controlled links acting in concert to control the flow of data from a primary source to an ultimate destination.
For correct operation, the simple link level flow control scheme requires that the depth of the receive buffer be equal to or exceed two round trip times (RTT) of the link. For example, assume the threshold on the buffer is set at one RTT. When the signal to stop occurs there must be one RTT of buffer remaining to capture data of the flow that is “in flight”, i.e., the data traversing the link during the time it takes to send the stop signal to the source and drain the link once the source has stopped. Once the buffer at the destination empties to the one RTT level, a start signal can be sent to the source. Notably, there must be one RTT worth of data in the buffer to maintain the data flow until the start signal can reach the source and the source can fill the link to the receiver.
IB switches typically utilize flow control with very little buffering because the RTT for a link is typically very small. For example, the buffering in the switch is sized to a depth sufficient to accommodate the RTT delay of the link plus at least one packet because full packets are sent between the nodes. Thus, the size of the IB receive buffer must be sufficient to hold two maximum size packets because flow control information can only be sent on the return path between packets. This depth is needed to ensure that data is not lost, while maintaining full rate transmission over the link. For IB switches and a 2 kilobyte (KB) maximum transfer unit (MTU), the buffering needed is only 4 KB which is more than sufficient for RTTs of typically expected lengths of the links.
IB switches utilize a more sophisticated variant of on/off flow control referred to as credit-based flow control. Rather than utilizing a simple ON/OFF flow control signal, the switch uses a credit-based system. According to this scheme, the destination sends a message to the source indicating an amount of buffering (X) extended to the source for its transmission. That is, the destination extends “credits” (buffers) to the source for a data flow and it then reserves those buffers for that flow. The information contained in the message reflects the ability of the network to deliver data based on the amount of data that the receiving end (destination) can forward. Yet, that information always “lags” current network conditions because of the time it takes to generate and deliver the message. If the extended buffers are not sufficient to accommodate the RTT, this scheme still works because by allocating an exact amount of buffer space, the source does not send more data than it has been credited (hence, a credit-based scheme). In contrast, if the buffers in an on/off flow control scheme are not sufficient to cover the RTT, then it is possible to lose data due to buffer overrun.
A problem arises when the link-by-link flow control scheme is used in connection with a fair allocation bandwidth policy implemented by the switches of a network. FIG. 1 is a schematic block diagram of a conventional communications network 100 having a plurality of switches interconnected by point-to-point links. A source end node (S1) is connected to a first switch (SW1) and a plurality of source end nodes (S2, S3) is coupled to a second switch (SW2). In addition, there is a plurality of destination end nodes (D1, D2) coupled to a third switch (SW3). Assume that S1 and S2 send data to D1, while S3 sends data to D2.
As noted, the switches implement a type of fair allocation “arbitration” (e.g., round robin) of bandwidth for data flows received over the links that are destined for, e.g., D1 and D2. Such a policy ensures an even distribution of link bandwidth among each data flow. Assume further that there is 1× worth of bandwidth available over links L1 and L5, but 4× worth of available bandwidth over links L2 and L4. Since S1 and S2 are sending data to D1 (and ultimately over L1), ½× bandwidth of L2 is allocated to S1 's data flow and ½× bandwidth of L2 is allocated to S2's data flow. Similarly, ½× bandwidth of L3 is allocated to S1 's data flow and ½× bandwidth of L4 is allocated to S2's data flow.
Assume now S3 transfers data to D2. It would be desirable to allocate 1× bandwidth over each link coupling S3 to D2 so as to optimize that data flow. However, this is not possible even though L2 and L4 can easily accommodate such bandwidth. This is because the flow control scheme limits the bandwidth according to the fair arbitration policy operating on the switches. That policy fairly allocates ½× of L2 to S1 and S2, and proceeds to allocate the same bandwidth (½×) to S3's data flow. That is, notwithstanding an attempt by S3 to transmit 1× bandwidth over the network of links, the link-level flow control limits that flow to ½×. This is an example of a classic “parking lot” problem where local fairness does not lead to global fairness.
The parking lot problem is easily illustrated as a series of points in a parking lot where cars in different rows of the lot attempt to gain access onto a single exit roadway that runs through the parking lot to an exit. If at each point where the cars merge the drivers allow each other to alternate access to the exit road, the “fair” behavior of the drivers penalizes a driver at the back of the parking lot because that driver is allotted much less access to the exit road than a driver at the point closest to the exit.
Congestion points in an IP network are typically identified within IP switches by monitoring the average buffer (queue) length and either dropping or marking packets. This works because the data that cannot be sent through the bottleneck point will necessarily build-up in the switch buffers. Since the IB network switches have little buffering and link-by-link flow control, those switches are not designed to use the buffers to store data during a contention period; moreover, the switches are designed to specifically not drop any data and to stop the incoming data through link-by-link flow control. As a result, the buffering fills and empties too quickly for an average occupancy to be meaningful as a way to indicate congestion and dropping packets is not allowed as a way to provide feedback to the source. In this type of a network, the links are subject to congestion spreading effects if the end nodes do not reduce their outputs to an amount sustainable through a bottleneck rate of the network.
One way to solve congestion spreading is to separate flow control feedback by specific source. This is particularly useful within, e.g., an asynchronous transfer mode (ATM) switch, where there may be many virtual circuits (VC). A VC path is created having a specific identifier and the flow control information is sent per VC. This enables, e.g., S1 to transmit its flow at a rate that is different from S2. Flow control is thus not “tied” to arbitration and does not limit all flows.
Another solution is to allow short-term congestion spreading in the presence of “long-term” flows, but to use a longer response time, end-to-end flow control mechanism to adjust the long-term flows to a rate that the network can sustain without exhausting the short-term resources. A long-term flow is a flow that lasts much longer than an end-to-end, RTT through the network, e.g., multiple round trip times. That is, long term is measured by a flow that lasts long enough to allow control by feedback from the ultimate destination end node, similar to TCP. Control of a long-term data flow can be subject to a closed-loop control system, but the response time of the loop must be many round trip times. Such a system must consider a control loop time equal to the many RTT plus the processing time Ptime at both nodes. The data to be sent must take a time to send that is also substantially greater than the RTT and Ptime or there will be nothing to control.
Congestion arises when a flow of transmitted data meets another flow at a point that both must share such that data “backs up” the receive buffer at this point, requiring flow control back to the source. In this case, an end-to-end flow control scheme may be used to “throttle” the flow to a rate that is supported by the network. However, the network-supported rate may change over time. The present invention is directed to controlling “long-term” flows such that, if these flows last long enough, information pertaining to the supported rate is substantially correct (current) and useful in controlling the source.
Known schemes for end-to-end congestion management generally rely on network feedback in the form of dropped packets or marked packets that would have been dropped at network switches. For these schemes it is assumed that the network switches have substantial buffering and that the switches can measure the average utilization of those buffers. Thereafter, when the average exceeds a threshold, packets are randomly either dropped or marked, or eventually the buffers fill and overflow, resulting in lost packets.
A problem with using such schemes in an lB communications network is the desired property of the network that it specifically avoid the loss of packets. With buffering only for delays associated with transmitting flow control information back to the source of a link and no packet loss, it is not feasible to use these prior art schemes in an IB switch to identify congestion and mark packets. The present invention is directed to a technique that reduces congestion and congestion spreading in the presence of long-term flows traversing a “lossless” communications network configured to avoid packet loss.
One known end-to-end flow control system utilizes a packet-pair, rate-based feedback flow control scheme disclosed in a paper titled, Packet-Pair Flow Control, by S. Keshav, IEEE/ACM Transactions on Networking, February 1995. Keshav discloses the use of packet pairs in a control loop to estimate a system state, e.g., by measuring throughput of the network, which estimation is critical in enabling the control scheme. A source node uses smoothed measurement of arrival times of acknowledgements of the packet pairs to adjust the throughput rate either up (increase) or down (decrease) in the TCP domain. This scheme, like the one disclosed herein, does not rely on feedback from network switches or routers to identify congestion on a data path. However, this scheme uses complex estimators and relies only on those estimators for setting specific transfer rates. Subsequent work showed practical difficulties in using an averaged inter-packet gap as a sole control variable.