1. Field
The present invention disclosure relates to the art of computing. More particularly, this invention disclosure is directed toward a use of a translation look-aside buffer (TLB) for determining metric for selective caching.
2. Description of Related Technology
In computing, a cache is a component interposed between a processor unit and a main memory. A cache stores data and/or instructions that might be the results of an earlier processor unit computation and/or may be duplicates of data and/or instructions stored in another memory structures, e.g., another cache, the main memory. Future requests for the data and/or instructions by the processor unit can be served faster from a cache than when the data and/or instructions were to be recomputed or requested from the slower main memory. Thus, when the processor unit requests to access data and/or instructions at a location in a main memory, the processor unit first checks for the data and/or instructions in the cache. A cache comprises blocks of fixed size, called cache lines. Cache line includes a copy of a portion of the data and/or instructions from the main memory or from another cache, as well as the requested main memory or the another cache location's address, and a status, called a tag. The status describes an attribute of a cache line, e.g., whether a line is modified with respect to main memory, how recently a cache line has been accessed by the processor unit, whether a line is readable-only or readable and writeable, what processor unit capabilities, e.g., permissions, are required to access the cache line and other attributes known to a person of ordinary skill in the art. If the data and/or instructions is found in the cache, a cache hit has occurred and the processor unit immediately reads or writes the data into the cache line. However, if the processor unit does not find the data and/or instructions in the cache, a cache miss has occurred, the cache fills a new entry, and copies data and/or instructions from the main memory to the entry and the processor unit's request is then fulfilled from the contents of the cache.
However, if the cache is full, the cache must evict some previously stored cache lines to fill a new cache line. Enlarging a cache mitigates the need for eviction, thus improving hit rates, but increases latency. To address the tradeoff between latency and hit rate, multiple levels of cache are introduced, with small fast cache at level 1 (L1) being backed up by larger, slower caches at level 2 (L2), and optionally higher levels (L3, L4). Multi-level caches are generally checked from the fastest L1 cache first; if the L1 cache hits, the processor unit uses this cache, if the L1 cache misses, the next fastest cache (L2) is checked, and so on, before main memory is checked. The highest-level cache, which is checked before accessing the memory, is usually referred to as the last level cache (LLC).
Since the caches are capacity constrained to assure better latency performance than the main memory, cache thrashing may occur, wherein cache lines are evicted from a cache by a cache eviction policy before the cache lines are reused, resulting in fewer hits into the cache.
On the other hand, it is known to a person of ordinary skill in the art that not all cache lines in a cache are reused; in extreme cases, as many as 90% or more the cache lines are never re-accessed between fill and eviction. Therefore, cache utilization can be maximized by selective caching, i.e., a technique wherein the most valuable cache lines i.e., cache lines that are likely to be reused, are kept in the capacity constrained cache.
In one proposed selective caching technique, use of a metric comprising a shadow tag has been proposed. A shadow tag is a per processor core tag, which is used to model the cache miss rate of a processor core for a given arbitrary fraction of cache capacity, cf. Qureshi, Moinuddin K., and Yale N. Patt. “Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches.” International Symposium on Microarchitecture (MICRO), 2006.
However, such a technique does not scale well. Consider, by means of an example, a multi-processor unit with 48 processor cores. The number of shadow tag bits required to maintain access history for each cache line, is directly proportional to the number of processor cores. That is, for a 48 processor core multi-processor unit, 48 sets of shadow tag access history bits must be maintained for each cache line in each cache that the processor core may access. Additionally, cache shadow tag read/write bandwidth is required to maintain, access, and update these access history bits. Therefore, the overhead associated with storing, e.g., 1024 shadow tags per processor core and each cache in e.g., the 48 processor core multi-processor would require a prohibitive memory overhead and read/write bandwidth.
Accordingly, there is a need in the art for a method and an apparatus implementing the method of selective cashing, providing a solution to the above identified problems, as well as—providing additional advantages.