1. Technical Field
The present invention relates generally to an improved data processing system, and in particular, to an improved method and apparatus for caching data in a memory.
2. Description of Related Art
Most early data processing systems consisted basically of a central processing unit, a main memory, and some sort of secondary input/output (“I/O”) capability. In these earlier systems, the main memory was the limiting element. Typically, the main memory was designed first and the CPU was then created to match the speed of the memory. This matching was performed to optimize the processing speed and is necessary even with today's high speed computers. Over time, logic circuit speeds increased along with the capacity requirements of main memory. With the need for increasing capacity in the main memory, the speed of the main memory could not keep up with the increasing speed of the CPU. Consequently, a gap developed between the main memory and the processor cycle time, which resulted in un-optimized processing speeds. As a result, a cache memory was developed to bridge the gap between the memory and the processor cycle time.
Using a cache to bridge the performance gap between a processor and main memory has become important in data processing systems of various designs from personal computers to work stations to data processing systems with high performance processors. A cache memory is an auxiliary memory that provides a buffering capability through which a relatively slow main memory can interface with a processor at the processor's cycle time to optimize the performance of the data processing system. Requests are first sent to the cache to determine whether the data or instructions requested are present in the cache memory. A “hit” occurs when the desired information is found in the cache. A “miss” occurs when a request or access to the cache does not produce the desired information. In response to a miss, one of the cache “lines” is replaced with a new one. The method to select a line to replace is called a replacement policy.
A number of different schemes for organizing a cache memory exist. For example, a fully associative mapping organization may be employed whereby a data address may exist in any location in the cache, or a direct mapping scheme may be employed in a cache memory whereby a data address may exist in only one location in the cache. A set associative scheme may be employed by partitioning the cache into distinct classes of lines, wherein each class contains a small fixed number of lines. This approach is somewhere between a direct mapped and a full associative cache. The classes of lines are usually referred to as “congruence classes.” The lines in a congruence class are usually referred to as sets (which indicate the number of locations an address can reside) in a congruence class in a set associative cache.
One generally used type of replacement policy is the least-recently-used (LRU) policy. An LRU policy is built upon the premise that the least recently used cache line in a congruence class is the least worthy of being retained. So, when it becomes necessary to evict a cache line to make room for a new one, an LRU policy chooses as a victim a cache line which is the least recently accessed set (or member) within a congruence class.
For an LRU policy, two types of operations must be carried out against the LRU state (which is maintained for each congruence class in a cache).
A most-recently-used-update (MRU-update) operation typically occurs due to a cache hit. It adjusts the LRU state such that the “hit” member is ordered ahead of all other members in that congruence class, establishing the cache line in that member position as the most worthy member in the congruence class.
A least-recently-used-victim-selection (LRU-victim-selection) operation typically occurs when a cache miss requires that a member be allocated to hold a cache line arriving from elsewhere in the storage hierarchy. The operation determines which cache line is the least worthy of being retained in the congruence class, evicts that cache line, and places the newly arriving cache line in its member position.
Several factors complicate the behavior of LRU replacement policies in multi-level cache hierarchies, particularly when those hierarchies contain nth level caches that are shared by multiple structures at level n−1. For example, a processor may contain a first level instruction cache and a first level data cache. These may be backed by a second level cache that includes both instructions and data. Such a structure is designed for the following purpose: so that processor requests for cache lines that miss in the first level caches have a high likelihood of being found in the second level cache.
As described earlier, the LRU replacement policy in the first level caches would update as most-recently-used those cache lines that are used most often by the processor. Cache lines that are less important (or worthy) to the processor, since they are used less often, would be less likely to be marked as most-recently-used. Thus, the more frequently used lines tend to remain in the first level cache, while the less frequently used lines tend to be evicted from the first level cache.
The LRU policy in the second level cache would update as most-recently-used those cache lines that are requested from the second level cache when a first level cache miss occurs. These lines would tend to be those lines which were evicted from the first level cache, and are less worthy to the processor than the cache lines which tend to hit in the first level caches. Thus, the cache lines that most often are not found in the first level caches, but are repeatedly needed by the processor, are the cache lines most likely to remain in the second level cache, due to the fact that they are more likely to be beneficially affected by MRU-updates.
Ironically then, the cache lines which are most worthy to the processor are less likely to benefit from MRU-updates in the second level cache, and hence, are more likely to be evicted from the second level cache than the cache lines which are less worthy to the processor.
This behavior can be quite pronounced when multiple first level (or n−1 level) caches are backed by the same second level (or nth level) cache, especially when those first level caches have differing patterns of miss traffic. For example, many applications have small instruction footprints but high rates of data turnover (i.e., data footprints that exceed the size of the second level cache), resulting in very few first level instruction cache misses relative to first level data caches misses, and requiring that significantly less capacity in the second level cache be allocated for instructions than for data.
In such an application, even though instructions require a smaller portion of the second level cache, the fact that the first level instruction cache seldom misses, combined with the fact that the first level data frequently misses and that the data footprint exceeds the capacity of the second level cache, the instructions, so well behaved in the level instruction cache, would tend to be evicted from the larger, shared, second level cache. Such application behaviors will hereafter be referred to as “unbalanced” caching behaviors.
Another type of application might have an instruction footprint that is too large to be contained in the first level instruction cache and a data footprint that is too large to be contained in the first level data cache. In this case, both instruction and data caches miss frequently enough relative to each other to better balance the likelihood of MRU-updates in the second level cache for instruction cache lines versus for data cache lines. The natural pattern of requests fairly balances the allocation of second level cache capacity between instruction cache lines and data cache lines, with this allocation more accurately representing the true worthiness of these cache lines to the processor. Such application behaviors will hereafter be referred to as “balanced” caching behaviors.
Inclusion occurs if a block of data is present in an L1 cash of a given processing unit, and this block of data also is present in other caches, such as the L2 and L3 caches, of that processing unit. It a system structure requires the property of inclusion between the second level cache and the first level caches, there can be significant performance consequences to applications with unbalanced caching behaviors. When cache lines, well behaved in a first level cache and important to the processor, are evicted from the second level cache (since they seldom receive the benefit of an MRU-update in the second level cache), inclusion dictates that they must be evicted from the first level cache as well.
In some cases, this has been overcome by limiting inclusion to data, and architecting software managed coherency schemes for instructions, or in other cases, by directly snooping first level caches. As the number of processors in a system increases, such schemes become less and less viable.
Further, if a second level cache is shared by multiple processors (or processing threads), the caching (i.e., performance) behaviors can be negatively affected by the level of “balance” between instructions and data within the application thread on any given processor as well as the relative “balance” between the application threads as a whole. Conversely, caching behaviors can be positively impacted when multiple application threads share data or instructions.
Those of ordinary skill in the art will recognize numerous schemes for biasing to overcome the consequences of unbalanced behaviors in fully inclusive caches, both within an application thread or amongst multiple such threads. Such schemes typically involve establishing multiple cache partitions and restricting the use of those partitions to certain types of operations. This can be accomplished by augmenting a standard replacement policy, such as LRU, to respect the partitions.
For example, a small fixed size region of the second level cache can be restricted for use by instruction cache lines only, with the remainder allocated to other (e.g., data) cache lines. Such an approach provides benefit to an “unbalanced” application. Such an approach might be detrimental to a well balanced application whose instruction footprint exceeds the capacity of the small fixed size region. Increasing the size of the region dedicated for instructions in response to this concern might be detrimental to the “unbalanced” application or to a “balanced” application with a larger ratio of data footprint size relative to instruction footprint size, since increasing the instruction region would decrease the data region.
In the case of multiple processors, a second level cache might be divided into equal sized partitions, one for each processor. Such an approach can possibly provide benefit when “unbalanced” and “balanced” applications of varying degrees share a second level cache. Such an approach can be detrimental when one application has significantly less demand for the second level cache than another application, but that other application is prevented from utilizing any of the second level cache outside of its allotted partition. Such an approach might also reduce the synergy that might otherwise occur when multiple application threads exhibit a high degree of sharing of instructions and/or data.
While static partitioning schemes in shared, second level, inclusive, caches can improve performance for applications with unbalanced caching behaviors, these same schemes can be detrimental to the performance of other applications with different levels of balance or sharing.
Therefore, it would be advantageous to have an improved method, apparatus, and computer instructions to dynamically manage caching behavior in a data processing system to improve performance.