In this description and claims, a “working set” is abstractly defined to be a collection of instructions and/or associated data. For example, a working set could describe all instructions and data allocated by a particular process or thread, a particular data structure within a process's address space, or a thread's or process's most frequently accessed subset of instructions and data. A working set may belong, for example, to any of the following entities: a process, thread or fiber, or an application or service composed of multiple processes. In this description and claims, “entity” is defined as the container or owner of the working set.
As data and instructions are required by the processor of a computer, they are transferred from the main memory of the computer to the processor. The latency inherent in obtaining items from the main memory may be quite large. A cache is a memory that is smaller and accessed more quickly by the processor than the main memory. The processor cache may be located on the chip with the processor, on the processor socket or elsewhere. A page is the unit of memory that is used in main memory management and allocation. Each page is composed of several cache lines, which are the units used in cache memory management. A failed attempt by the processor to access an item in its cache, known as a “cache miss”, causes the item to be accessed from the main memory, which adds latency.
Applications running on the computer describe the location of data and instructions using virtual addresses that refer to a virtual address space. The operating system maps or translates the virtual addresses to corresponding physical addresses as needed. If the processor cache is a physically-indexed cache, a portion of the physical address, known as the cache index bits, is used to determine where the item will be copied to in the cache. Therefore, translation or mapping of a virtual address to a physical address by the Operating System implicitly selects the location in the cache where the page will be stored. Since the processor cache is smaller than the main memory and only a portion of the physical address is used as the cache index bits, several physical addresses will map to the same location in the cache.
Consider the example where two threads, thread 1 and thread 2, are simultaneously scheduled on two processor cores that share a cache. Assume that thread 1 accesses a relatively small working set, working set A, that fits within the cache, and that thread 2 accesses a much larger working set, working set B, that exceeds the size of the cache. Current page coloring algorithms may assign the virtual addresses of working set A to physical addresses having the same cache index bits as working set B. Also, processors typically use least recently used (LRU) cache replacement policies or similar policies, which strive to keep the most recently used data in the cache. Thus if thread 2 accesses working set B faster than thread 1 accesses working set A, the processor will allocate the data of working set B into the shared cache and evict the data of working set A. The result is thread 1 will encounter a larger number of cache misses than what would have occurred if thread 2 was not scheduled simultaneously on the adjacent processor core. Thread 1 will therefore experience performance degradation.
Even on a single-core processor, cache competition between heterogeneous working sets may occur when two or more threads or applications are executed simultaneously or time-share a processor. Likewise, cache competition may occur between the working set of a single application and operating system processes or threads.
Most processor caches are N-way set-associative caches. In an N-way set-associative cache, a page whose physical address has a given value of cache index bits can be stored in any of N locations in the cache.
Page coloring is the mechanism an Operating System may use to map virtual addresses to physical addresses with specific cache index bits in order to effect processor cache placement. The value of the physical address bits determining the cache index is known as the page color. For example, if the page size is 4 kilobytes (KB) and the cache is a 4 megabyte (MB) 1-way cache, then there are 1024 distinct page colors. If the page size is 4 KB and the cache is an 8-way set-associative 4 MB cache, then there are 128 distinct page colors. Furthermore, page coloring also influences the location of physical pages within the main memory system. When page coloring is not employed, virtual addresses are mapped to physical addresses without regard for the value of the cache index bits in the physical address.
Current page coloring algorithms tend to distribute the pages of a working set uniformly among the processor cache to the extent possible. Some operating systems implement a page coloring algorithm known as bin hopping, in which pages that are sequentially allocated are mapped to sequential page colors, irrespective of their virtual addresses. Bin hopping exploits temporal locality because the pages it maps close in time tend to be placed in cache locations having different page colors. Bin hopping prevents related pages of a single working set from competing for the same cache lines. However, bin hopping may exacerbate the competition for cache lines between different working sets because all working sets are spread across all page colors.