Computing in virtualized cloud environments has become a common practice for many businesses so as to reduce capital expenditures. Many hosting companies (which host the machines, such as servers, used by businesses) have found that the utilization of their machines is low.
In order to keep the machine utilization high, the hosting companies that maintain the host hardware typically attempt to keep just enough machines to serve the computing load and to allow multiple virtual machines to coexist on the same physical hardware. High central processing unit (CPU) utilization has been observed in many virtualized workloads.
The aforementioned trend means that the host machines are constantly occupied by applications from different businesses, and frequently, different contexts (processes or threads) are executed on the same machine. Although it is ideal for achieving high utilization, the performance of guest applications suffers from frequent context switching (process of storing and restoring the state of a process or thread so that execution can be resumed from the same point at a later time). The memory subsystem has to maintain consistency across the different contexts, and hence traditionally, processors used to flush caches and translation lookaside buffers (TLBs). A TLB stores the recent translations of virtual memory addresses to physical memory addresses. However, modern processors adopt a more efficient approach where each entry contains an address space identifier (ASID). Tagging the entry with ASID eliminates the needs to flush the TLB upon a context switch, and when the swapped-out context returns, some of its previously cached entries will be present. Although these optimizations worked well with traditional benchmarks where the working set, or memory footprint, was manageable between context switches, this trend no longer holds for emerging workloads. The memory footprint of emerging workloads is orders of magnitude larger than traditional workloads, and hence the capacity requirement of TLBs as well as data caches is much larger. This means that the cache and TLB contents of the previous context will frequently be evicted from the capacity constrained caches and TLBs since the applications need a larger amount of memory. Although there is some prior work that optimizes context switches, there is very little literature that is designed to handle the context switch scenarios caused by huge footprints of emerging workloads that flood data caches and TLBs.
The performance overhead of address translation in virtualized systems is considerable as many TLB misses incur a full two-dimensional page walk. The page walk in a virtualized system begins with a guest virtual address (gVA) when an application makes a memory request. However, since the guest and host system keep their own page tables, the gVA has to be translated to a host physical address (hPA). First, the gVA has to be translated to the guest physical address (gPA), which is the host virtual address (hVA). The hVA is finally translated to the hPA. This involves walking down a two-dimensional page table. Current x86-64 (64-bit version of the x86 instruction set) employs a 4-level page table, so the two-dimensional page walk may require up to 24 accesses. Making the situation worse, emerging architectures introduce a 5-level page table resulting in the page walk operation getting longer. Also, even though the L1-L2 TLBs are constantly getting bigger, they are not large enough to handle the huge footprint of emerging applications, and expensive page walks are becoming frequent.
Context switches in virtualized workloads are expensive. Since both the guest and host processes share the hardware TLBs, context switches across virtual machines can impact performance severely by evicting a large fraction of the TLB entries held by processes executing on any one virtual machine.
Conventional page walkers as well as addressable large capacity translation caches (such as the Oracle® SPARC® translation storage buffer) generate accesses that get cached in the data caches. In fact, these translation schemes rely on successful caching of translation (or intermediate page walk) entries in order to reduce the cost of page walks. There has also been some recent work that attempts to improve the address translation problem by implementing a very large L3 TLB that is a part of the addressable memory. The advantage of this scheme titled “part of memory TLB (POM-TLB)” is that since the TLB is very large (several orders of magnitude larger than conventional on-chip TLBs), it has room to hold most required translations, and hence most page walks are eliminated. However, since the TLB request is serviced from the dynamic random access memory (DRAM), the latency suffers. The POM-TLB entries are cached in fast data caches to reduce the latency problem, however, all of the aforementioned caching schemes suffer from the problem of cache contention due to the additional load on data caches caused by the cached translation entries.
As L2 TLB miss rates go up, proportionately, the number of translation-related accesses also goes up, resulting in congestion in the data caches. Since a large number of TLB entries are stored in data caches, now the data traffic hit rate is affected. When the cache congestion effects are added on top of cache thrashing due to context switching, which is common in modern virtualized systems, the amount of performance degradation is not negligible.