Large scale and cloud datacenters are becoming increasingly popular. Small, medium, and large businesses alike are turning to these datacenters for their data storage needs, computational tasks, applications and IT jobs. This helps them eliminate the expensive, and often very complex, task of building and maintaining their own infrastructure. A datacenter is typically architected with numerous interconnected storage devices, switches, and servers that may be shared across multiple users. Users access the datacenter over wide area networks that rely on IP-based protocols to transmit their data back and forth. As datacenters grow, so does the number of packets delivered across the networks and the need to keep their transmission reliable while maintaining application throughput.
A common problem affecting datacenter networks is packet loss and reduced application throughput due to incast collapse. Incast collapse occurs when multiple servers simultaneously send data to a destination server such that the number of packets sent is larger than the available buffer space at the network switch to which the destination server is connected. The highly bursty traffic of multiple and simultaneously arriving packets overflow the switch buffers in as short period of time causing intense packet losses and thus leading to timeouts. Incast collapse tends to afflict applications (e.g., search, data storage, etc.) that follow a “partition-aggregate” model: a single server (“S”) processing a request sends sub-requests to a large number (“N”) of other servers in parallel (“partition”), then waits for their answers before giving it's own response (“aggregate”).
The incast collapse problem arises because the answers being aggregated are sent as network packets by all N servers at about the same time, i.e., they are “synchronized”. The server S is connected to the datacenter network via an edge switch “E”, and so these N packets (or more generally, N*M, usually for some small value of M) all arrive at E at the same time. As most datacenter networks employ inexpensive edge switches with relatively limited buffering due to cost reasons, the number of simultaneously arriving packets for S may be larger than the available buffer space at E. The result is that some packets are dropped, which can lead to excessive TCP timeouts thereby causing serious violations of throughput and latency targets for these applications.
Since datacenter applications usually rely on TCP to provide reliable, congestion-controlled transport, the effect of a packet loss is that TCP must retransmit the lost packet. In many circumstances, TCP relies on timeouts to resend lost packets; traditional TCP timeouts are no shorter than a few hundred msec. These timeouts therefore create large application-level latencies and reduced application throughput. With datacenters often requiring overall application response times of a few hundred milliseconds for 99.99% or more of the requests, packet loss due to incast collapse can therefore have a significant impact on datacenter network performance.