Advances in semi-conductor processing and logic design have permitted an increase in the amount of logic that may be present on integrated circuit devices. As a result, computer system configurations have evolved from a single or multiple integrated circuits in a system to multiple cores, multiple hardware threads, and multiple logical processors present on individual integrated circuits. A processor or integrated circuit typically comprises a single physical processor die, where the processor die may include any number of cores, hardware threads, or logical processors. The ever increasing number of processing elements—cores, hardware threads, and logical processors—on integrated circuits enables more tasks to be accomplished in parallel. However, the execution of more threads and tasks put an increased premium on shared resources, such as memory, and the management thereof.
Typically, cache memory includes a memory between a shared system memory and execution units of a processor to hold information in a closer proximity to the execution units. In addition, cache is typically smaller in size than a main system memory, which allows for the cache to be constructed from expensive, faster memory, such as Static Random Access Memory (SRAM). Both the proximity to the execution units and the speed allow for caches to provide faster access to data and instructions. Caches are often identified based on their proximity from execution units of a processor. For example, a first-level (L1) cache may be close to execution units residing on the same physical processor. Due to the proximity and placement, first level cache is often the smallest and quickest cache. A computer system may also hold higher-level or further out caches, such as a second level cache, which may also reside on the processor but be placed between the first level cache and main memory. And a third level cache may be placed on the processor or elsewhere in the computer system, such as at a controller hub, between the second level cache and main memory.
With the increasing number of processing elements per processor, the demands on caches have become more complex and greater in number. In fact, when heterogeneous applications are being executed on a single processor, the demands from each individual application may vary wildly—some applications needing more cache space for efficient execution as compared to other applications. In that instance, a centralized, shared cache memory may be better suited to allocate space efficiently by providing more space to those applications that need more cache space. However, the latency associated with a centralized, shared cache potentially degrades performance; especially when compared to a distributed cache system. In a distributed cache system, the caches are able to be placed physically closer to execution units, reducing latency. Unfortunately, previous distributed systems often relegate an application to a single slice of the distributed cache; especially when the distributed caches are private caches—primarily hold data for an associated processing element, such as a core or hardware thread. And therefore, a distributed system is typically unable to efficiently allocate extra cache space to applications requiring such. For example, an application requiring more than the private cache space, previously hasn't been able to hold such lines in other private caches, even when the other caches have available capacity.