Large enterprises today predominantly use virtualized data centers for their information technology (IT) infrastructure. Virtualization provides two advantages to the enterprise computing landscape. The first advantage is that virtualization can provide significant improvements to efficiency, as physical machines become significantly powerful with the advent of multicore architectures with a large number of cores per physical CPU. Further, memory has become extremely cheap today. For example, it is not uncommon to see 100 s of Gigabytes of RAM available in many commodity servers. Thus, one can consolidate a large number of virtual machines on to one physical machine. The second advantage is that virtualization provides significant control over the infrastructure. As computing resources become fungible resources, such as the cloud model, provisioning and management of the compute infrastructure becomes very easy. Thus, enterprise IT staff prefer virtualized clusters in data centers for their management advantages in addition to the efficiency and better return on investment (ROI) that virtualization provides.
While virtualization is becoming widely adopted world-wide, modern operating systems and network protocols, historically, have not been designed with virtualization in mind. Therefore, traditional Operating Systems (OS) have limitations that make them perform less efficiently in virtualized environments. Basically, as a layer of indirection is added to a physical server to abstract the CPU, memory and I/O resources, in the form of a hypervisor, new types of performance bottlenecks, such as reduction in network protocol throughput (e.g., Transport Control Protocol over Internet Protocol (TCP/IP) throughput), are created that were non-existent before.
Virtual machines (VMs) are typically assigned virtual computing instances called vCPUs (or virtual CPUs). As virtualized servers get significantly consolidated in data centers, there are a large number of VMs sharing the available CPU resources, i.e., the available physical cores (physical CPUs or pCPUs). The ratio of vCPUs allocated to all the running VMs to total available pCPUs is typically known as the overcommit ratio. The level of overcommit in different environment varies significantly, but it is rarely close to 1. The main reason for this is the fact that, in many virtualized environments, the average CPU utilization is quite low. Because of this reason, a high overcommit ratio is desirable to get the best ROI from the available compute resources.
Unfortunately, server consolidation has a significant negative impact on the performance of transport protocols such as TCP. In virtualized data centers, there is often a lot of server-to-server traffic running over the TCP protocol. The network latencies (measured as the time it takes from one server's NIC to the other server's NIC) are typically in the order of a few microseconds. Hypervisors, such as VMware, have become extremely efficient at keeping the number of instructions executed to process an individual packet to very small number. Therefore, as packets arrive from the network and the VM is scheduled, they experience very little additional latency due to virtualization. The key problem, however, is that when a given VM is not scheduled, network data transfer for a given connection within that VM effectively stops, since TCP requires both ends to be active for data transfer to progress. Even when only one end is transmitting data to the other end, it still requires the other end to respond back with acknowledgements before the transmitting end can transmit more data.
Empirical analysis has shown that traffic patterns in real enterprise clusters follows what is known as a power law distribution. Effectively, out of a given number of VMs, only a small number of them will actually generate traffic at any given time. Further, this power law is applicable even in the time domain. That is, a given VM will generate traffic every once in a while, and not all the time. Given these conditions, we can observe that all available network resources are not being used by the VM transmitting or receiving the traffic, if there are other compute-oriented VMs sharing available CPU resources that cause the network-intensive VMs to get scheduled in and out, thus degrading TCP performance significantly.
As servers are more consolidated, which occurs in environments such as the Virtual Desktop Infrastructure (VDI) space, the throughput degradation is even more significant. Since TCP is a bi-directional protocol, we observe the TCP throughput degradation in both directions—receive and send sides. The problem is even worse when a virtualized TCP sender is transmitting packets to a virtualized TCP receiver, since both ends are scheduled independently, which means, any of these ends can be off at a given time independent of each other. Since there is a much higher probability that their scheduling rounds may not be aligned, the throughput degradation is roughly double the amount when only one of the ends is virtualized and contending for CPU resources.
Various approaches to improve TCP processing in virtualized environments exist today. One approach is to keep the CPU overcommit really low (close to 1). In this case, the problem of CPU contention does not even arise and the problem does not manifest itself. The drawback of this approach is that the main benefit of virtualization, namely server consolidation, is pretty much lost.
A second approach is to have the VM offload the TCP processing to dedicated hardware referred to as the TCP Offload Engine (TOE). Since TOEs have dedicated hardware to offload the TCP processing, TCP processing can be performed even when the VM is not scheduled. Unfortunately, this approach requires specialized hardware that can be expensive and quite hard to change and reconfigure. Further, it may require proprietary drivers in the guest OSs that may be difficult in many environments such as the cloud. Due to these and possibly other reasons, this approach has not proved to be particularly popular in today's commodity data center networks.
A third possible approach is to change the scheduler to favor network-bound VMs that transmit and receive data packets. Unfortunately, it is difficult to implement this third approach since there is always an inherent need to ensure fairness across different VMs that contend for CPU resources.
Fourth, congestion control and acknowledgement generation can be performed by protocol responsibility offloading to a hypervisor with the help of a specialized plugin. This is a less intrusive option since it does not terminate TCP connections fully, but since hypervisors are scheduled typically on dedicated CPU cores, or are given higher priority, they can significantly boost TCP performance of different VMs. This approach has been previously proposed in the following two academic papers: (1) vSnoop: Improving TCP Throughput in Virtualized Environments via Acknowledgement Offload, Ardalan Kangarlou, Sahan Gamage, Ramana Rao Kompella, Dongyan Xu, in the Proceedings of ACM Supercomputing, New Orleans, La., November 2010 and (2) Opportunistic Flooding to Improve TCP Transmit Performance in Virtualized Clouds, Sahan Gamage, Ardalan Kangarlou, Ramana Rao Kompella, Dongyan Xu, in the Proceedings of ACM Symposium on Cloud Computing, (SOCC 2011), Cascais, Portugal, October 2011.
However, the Xen hypervisor approach described in these two papers have various limitations. For example, on the receive path, vSnoop acknowledges packets only if there is room in a small buffer, called a “shared buffer”, located in the virtual NIC between the hypervisor and guest OS. The vSnoop approach is dependent on the specific vNIC buffer of the Xen hypervisor, and restricted by the design and implementation of the Xen vNIC buffer. If there is no room in that buffer, vSnoop cannot acknowledge packets since the packet may be lost. Further, in a realistic deployment scenario, accessing the buffer is both challenging as well as intrusive. Another limitation is on the transmit path. The particular implementation described in these papers use a Xen hypervisor, which has a proprietary virtual device channel called the Xen device channel that is used to coordinate between the TCP stack in the guest and the vFlood module. This particular design requires intrusive changes to the hypervisor-guest interface boundary, which is not desirable.
Thus, a system and method for improving TCP performance in virtualized environments, that is both effective and practically deployable, is needed.