Modern electronic devices often include multiple processors, each sometimes referred to as a processing unit (PU), that each include core logic (a “core”), a level one (L1) cache, and a level 2 (L2) cache. Typically, each core can access only its own dedicated L2 cache, and cannot normally access the L2 cache of a nearby PU.
One skilled in the art will understand that there are many scenarios in which a core does not use its dedicated L2 cache to the maximum extent possible. For example, this may occur when a core executes code that uses the L2 cache only slightly or code from locked cache ways, when a core is powered down or in sleep mode, or when a core has been disabled, as, for example, in response to a detected manufacturing defect. These examples are but a sample of the many common scenarios in which a core underutilizes its dedicated L2 cache.
In light of this underutilization, there have been several attempts to improve cache performance, including some systems wherein one or more PUs share certain levels of their caches with each other. Each of the current approaches suffers from one or more disadvantages. Generally, one set of solutions focuses on castout handling, wherein the PU selects a cache line to “cast out” of its cache, ordinarily in order to make room for an incoming cache block that will be stored in the cache location currently occupied by the cache line selected for castout.
For example, one simple solution is to evict or “cast out” all cache lines to memory. That is, the simplest solution is to write back castout cache lines to memory when they are cast out. The castout lines can subsequently be retrieved over a common coherent bus, to which all L2 caches (and their associated PUs) are attached. However, this approach suffers from the obvious drawback that casting out all lines all the way to memory is inefficient and hinders performance. Further, this method does not enable one core to share another core's cache when that cache is underutilized. Additionally, this approach does not allow a cache to be employed when its core is powered down in sleep mode or has been deactivated because of a core manufacturing defect.
Another conventional approach provides a dedicated victim cache for each L2 cache. In this approach, evicted lines are cast out to the victim cache, and the victim cache is typically configured to hold only cache lines evicted from the L2 cache on a cache miss. This approach, however, adds an extra cache and supporting hardware, which consumes a greater area and power than the L2 cache by itself. Additionally, typical victim caches ordinarily allot space for only one or two lines per congruence class, compared to the six to eight lines in a standard cache, which therefore provides only a limited solution.
In another approach, hereinafter referred to as the Former approach, the PUs couple to a common L3 cache, and the L3 cache preselects one of three neighboring L2 caches to serve as a makeshift victim cache. Once the L3 cache selects the victim cache, the L3 cache and victim cache perform a request/grant handshake via a private communication, followed by a data-only transfer on a bus coupling the L3 and L2 caches.
The Former approach suffers from the disadvantage that it lacks a mechanism to track whether a cache line has been previously moved. As such, evicted lines in the Former system can circulate from cache to cache indefinitely, which can cause unnecessary bandwidth costs and hamper system performance. Further, the Former victim cache, the castout target cache, must accept the incoming cache line, which can require the victim cache to evict a cache line that it otherwise would have kept in the cache. As such, the Former approach can enhance the performance of one cache at the expense of another.
In another approach, hereinafter referred to as the Garg approach, illustrated by U.S. Pat. No. 7,076,609, the cores share two L2 caches, splitting the associativity across the L2 caches equally. The PUs share combined replacement controls, such as, for example, for L2 miss detection and handling. Specifically, the Garg approach allocates a new line, retrieved from memory in response to an L2 cache miss, into either of the L2s, depending on the replacement policy at that time. Further, the Garg approach searches both L2 caches simultaneously in response to an L1 miss.
As such, the Garg approach provides a shared, multi-bank level 2 cache, with a wide associativity. The Garg approach therefore also suffers from the disadvantages of a single shared cache. Specifically, Garg line replacement methods must search multiple L2 caches, which increases search time. Further, because the associativity in Garg extends across L2 caches, each Garg L2 cache must be searched whenever any one L2 cache must be searched, not only in the event of a local L2 cache miss. Additionally, because no Garg cache contains all the associativity for a particular congruence class, a cache replacement placed in one L2 cache will still miss in a local L2 not containing the cache line, which would ordinarily hit in a conventional system.
Therefore, there is a need for a system and/or method for optimizing neighboring cache usage in a multiprocessor environment that addresses at least some of the problems and disadvantages associated with conventional systems and methods.