A. Related Work
The TCP incast problem was reported first by D. Nagle et al. in the design of a scalable storage architecture. They found that the concurrent traffic between a client and many storage devices overwhelms the network as the number of storage devices increases. This results in multiple packet losses and timeout, forcing the client to be idle for a long RTO duration. To mitigate the incast congestion, they reduce the client's receive socket buffer size to under 64 kB. They also suggest to tune at the TCP level such as reducing the duplicate ACK threshold and disabling the slow-start to avoid retransmission timeout. However, they do not address the fundamental incast problem.
Two main approaches that address the incast problem have been proposed. The first approach reduces the RTOmin from a millisecond to a microsecond granularity. This solution is very effective for cluster-based storage systems where the main performance metric is to enhance TCP throughput. Nonetheless, it is not adequate for soft real-time applications such as web search because it still induces high queuing delay. The second approach is to employ congestion avoidance before the buffer overflows. RTT is usually a good congestion indicator in a wide area network, so that a delay based congestion avoidance algorithm such that TCP Vegas may be a good candidate. However, it is well known that the microsecond granularity of RTT in data centers may be too sensitive to distinguish the network congestion from the delay spikes caused by the packet/forwarding processing overhead. Therefore, DCTCP uses the Explicit Congestion Notification (ECN) to explicitly detect network congestion, and provides fine-grained congestion window based control by using the number of ECN marks. Another approach is ICTCP. ICTCP measures the bandwidth of the total incoming traffic to obtain the available bandwidth, and then controls the receive window of each connection based on this information. The incast congestion, however, is inevitable as the number of workers increases in these approaches.
B. Limitation of Window-Based Congestion Control
FIG. 1 depicts a typical topology where the incast congestion occurs. To avoid such incast congestion, the total number of outstanding packets should not exceed the network pipe size, which is obtained from the Bandwidth Delay Product (BDP).
This is expressed as:[Math Figure 1]BDP=Link capacity×RTT≧Σi=1nwi×MSS  (1)where, MSS denotes the Maximum Segment Size, n is the total number of concurrent connections, and wi is the window size of the ith connection. In this case, the BDP could be extremely small in data center networks. For example, if a network path has 1 Gbps of link capacity and 100 us of delay, then the BDP is approximately 12.5 kB or 8.3MSS when the MSS is 1.5 kB. This implies that Σi=1nwi should be less than 8.3 to avoid pipe overflow. In this case, the number of TCP connections that the path can sustain will be 8 at most if the minimal window size is one. In other words, more than 9 TCP connections may cause queuing delay and packet loss if all senders transmit at least one packet simultaneously. For this reason, the existing window-based control schemes basically are not scalable in the typical data center network applications that utilize a number of workers. This insight leads to a rate-based control approach based on the BDP measurements for data center environments.