A major motivation driving continuing improvement in computer technology is the consumer demand for more speed and power. One of the major obstacles for increasing the speed of a computer is the speed at which data can be accessed from memory. Accordingly, there is a strong emphasis on improving memory access times. The use of cache memory has helped to overcome this obstacle by providing a small amount of very fast memory that is used to store a copy of frequently accessed information. This cache memory is used along with the much larger but slower main memory. When a processor requests data from main memory, and the memory resides in the cache, then a cache ‘read’ hit occurs, and the data from the cache can be quickly returned to the processor. If the data is not in the cache, then a cache ‘miss’ is indicated, whereupon the data is retrieved from the main memory.
At the same time that increased memory retrieval speed is sought, it is desirable to reduce chip power consumption, provided it does not significantly reduce performance. As one approach to reduction of chip power consumption, the cache resources of a chip may be partly or completely put on standby (partially or completely powered down) as programs run. The choice can be dynamic and revised as programs run (dynamically and automatically).
A simple optimizing policy for cache size is defined as follows. All measurements are taken over a fixed interval Dt (which could be a time interval, cycle count or item count). The miss rate M is the number of workload events over Dt in which an item sought in the cache is not located and so had to be found in main memory. Depending upon cache storage policy, an item not found in the cache might or might not be added to the cache. The cache size C is fixed over each Dt but can be adjusted every Dt. If the miss rate M is below a threshold, then the cache size C is decreased. Otherwise, the cache size C is increased.
However, the workload may include in principle only one-time items (items seen once and then for all practical purposes never again). In this extreme case, caching has no performance benefit. Following this cache sizing policy with such a workload will result in no performance benefit but instead will cause eventual selection of maximum possible cache size with highest possible power consumption.
A method called Dynamically Resizable i-Cache (DRI) was proposed by Se-Hyun Yang et al. This approach is described in “Proceedings of the Seventh International Symposium on High-Performance Computer Architecture, 2001. The DRI increases or decreases the memory allocated to cached instructions using a comparison of current miss rate M with a predetermined target miss rate value. A mask shift by one bit is used on the memory index field, so that the memory change is always by a positive or negative power of two. The number of bits used to index to a cache slot (logically a variable equivalent to cache size) is denoted N. Thus, the amount of memory is doubled or halved with every change. According to Yang et al, the opportunity for changing N might be typically one every one million instructions.
The DRI method compares M to a target miss rate value. The size of the cache C is 2^N (2 raised to the power N) instructions. If the observed miss rate M is too high, then N is increased. If M is too low, then N is decreased. The simplest policy for cache consistency is that the cache is flushed every time N is changed, but more sophisticated approaches are described by Yang et al. There are many details that would be implementation-dependent, such as defining threshold miss rate values as well as a time interval over which to take the observed miss rate measurement. In fact, the DRI method apparently must find by trial and error an optimal cache miss rate threshold for any given benchmark. Then, as the benchmark is run, the memory changes are made in an effort to stay near the target miss rate.
According to work reported by Se-Hyun Yang at the SDI/LCS Seminar Series at Carnegie Mellon on Jan. 18, 2001, “Simulations using the SPEC95 benchmarks show that a 64 K DRI reduces, on average, both the leakage energy-delay product and average size by 62%, with less than 4% impact on execution time.” This is a very good result, but it comes at the expense of tuning the target miss rate on a benchmark-by-benchmark basis.
To improve DRI, others have proposed methods that turn individual lines of cache memory on and off. Line-by-line adjustment avoids purging the entire cache as the size changes, but is more complex to design and fabricate. For example, S. Kaxiras et al. in IEEE Workshop on Power Aware Computer Systems Nov. 2000 proposed turning lines off if they have not been used for a number of cycles. However, for each benchmark, the optimal number of cycles (called a “decay time”) to wait before turning off an unused line must be found by trial and error. The optimal delay for a benchmark called “jpeg” is about 13*1024 cycles (* denotes multiplication) and the optimal delay for a benchmark called “li” is about 97*1024 cycles. Furthermore, even within one application, different time delays are optimal at different times. Note that the optimal delay time for the extreme case, in which all items are one-time items, is zero cycles.
As reported in the Proceedings of 28th International Symposium on Computer Architecture (ISCA) Jun. 2001 Pages 240-251, subsequent work by Kaxiras et al notes that when an item is initially cached, it can be expected to be referenced many times in the short-term, then lapse into a period of nonuse. An adaptive method with multiple levels of cache hierarchy and “inclusion bits” indicating the presence of a cache line in a higher level are proposed. Again, this design is more complex to design and fabricate than DRI.
Another line-by-line approach is Adaptive Mode Control (AMC) by H. Zhou et al. recorded in MAC Transactions on Embedded Computing Systems. Vol.2, No, 3. August 2003, pages 347-372. AMC regards cache lines as having tag and data parts. AMC applies to both instruction and data caching, according to the authors. AMC keeps all tag lines “awake” all the time and only dynamically adjusts the number of awake data lines. Keeping the tag lines awake enables an ongoing measurement of how many misses are “ideal misses” (misses that would occur even if the entire cache were fully on) and “sleep misses” (misses that occur due to some data lines being asleep). AMC compares the number of all misses with a “performance factor” multiplied by the number of ideal misses. A feedback mechanism increases the delay time of a line if the number of misses is >150% of a “performance factor” (>1) multiplied by the number of ideal misses. It decreases the delay time if the number of misses is <50% of this same factor. The performance factor is set at a configuration according to the desired balance between performance degradation and static power savings. Complexity of implementation is again an issue.
If traffic is in the extreme case that all items are one time the performance factor, then caching has no benefit. If in addition the miss rate is above the threshold for cache size adjustment, then the cache size in DRI or the number of active lines in the other instances of prior art will increase, ultimately to the maximum possible. This will not benefit performance but will cause worst-case power consumption.
In particular, DRI is the simplest method of the three methods described above, but it can power up the entire cache resource when actually the ideal answer is to turn the cache completely off (the extreme case of all one-time traffic with a positive target miss rate).
When an item arrives for lookup, a hash function is applied to its label. The hash function may be simple (selection of some label bits) or complex (a mathematical function applied to some or all label bits). The value of the hash function is an index into the active cache memory. The index derived from an item may point to a memory location with zero, exactly one, or more than one stored (cached) memory. If the memory location has zero memories, then there is a miss and the lookup must be performed in main memory. The item might or might not be entered into the cache, depending upon cache policy. If there is exactly one stored memory, then the cache points to the one stored memory. The full label of the item is then compared to a stored value. If there is a match, then the action stored with the memory is applied. If there is not a match, then there is a miss and the item must be sought in main memory. Meanwhile, the cache memory might or might not be updated to store the missed item, depending upon cache policy. If there are two or more memories with the hit cache index, then the fill label of the item may be used in a Patricia tree (see D. Knuth, The Art of Computer Programming, Addison-Wesley, Reading Mass., 2nd ed, 1998, Vol. 3, p 498). The Patricia tree tests label bits until, at most, one stored memory might fit the item. The full item label is then compared with the stored memory. If there is a match, then the stored action is applied. If there is not a match, then the item must be sought in main memory. The cache might or might not be updated to include the item, depending upon cache policy.
If there are few or no Frequent Flyers (items that occur frequently in the traffic stream), then rates at which slots (index values) are hit by hash values in the cache are typically random. With more than 1 K (=2^10) possible items and more than 1 K slots, it is unlikely that the hit rate on any one slot will differ greatly from the average=total number of items/number of slots. For example, if there are 1 K items over an interval Dt and 1 K slots, then the probability that a randomly selected item is in a slot hit by a total of at least eight items is only about 0.001. If there are 1 M (=2^20) items and 1 M slots, then the probability that a randomly selected item is in a slot hit by a total of at least 10 items is about one in a million or 0.000001. However, if there are Frequent Flyers among the items, then they may occur at much higher rates, resulting in much higher hit counts for precisely the slots hit by Frequent Flyers. That is, with a too large cache, the hit rates of index values are highly skewed (a few are hit many times, most are hit seldom or not at all).
If there are S slots and X items and if items are mapped at random to slots, then the expected number Sk of slots with exactly k items isSk=S*CX,k*(1/S)^k*(1−1/S)^(X−k)=CX,k*(1/S)^(k−1)*(1−1/S)^(X−k)Here, CX,k=factorial(X)/(factorial(X−k)*factorial(k)).Lookup Mechanisms
For main memory, the prior art typically includes lookup by means of Direct Table and Patricia Tree searches. The key is hashed to N bits (perhaps by just using the first N bits of the full key or by a nontrivial hash function). Each N-bit value is an index into a Direct Table (DT). The number S of slots is thus 2^N. Each occupied slot in the DT contains no memories, exactly one memory, or points to a Patricia tree in which two or more memories are stored as leaves. At the leaf of the Patricia tree, the full key must be compared to a stored copy of the full key. If there is a match, then the slot points to the data needed for the item. If there is not a match, then a miss is declared and a leaf is added to the Patricia tree by well-known methods of prior art.
As previously noted, a cache is a set of items that are stored in a memory that typically is much smaller and faster than main memory. A key can be first or simultaneously sought in cache memory. Full comparison with key values is needed for a match, but an intermediate step might be a hash function. Numerous cache policies are possible for selecting which items to store in the cache.
The typical behavior of prior art cache sizing is shown on the charts represented by FIGS. 1 and 2. For example, this can be the above Automatic Cache Sizing algorithm with the variable skew locked at 0 (so never detecting skewness). In these two charts, the number of Frequent Flyers is constantly 512 (2^9) and their Proportion in all traffic is constantly 0.5. The value of Workload is initially 256 (2^8) per Dt (for 100 time steps), then 8192 or 2^13 (for 200 time steps), then 1024 or 2^10 (for 300 time steps), then 16384 or 2^14 (for 100 time steps), then 1024 (for 300 time steps). The value of Mmax is 4000. For cache limits, Cmin=2^8 and Cmax=2^10.
Cache use for this sequence is as follows: Initially, the cache is not needed, then even the full cache cannot quite keep M below 4000 (even with full cache M is about 4096) then it is not needed, then it is needed again but cannot keep M below 4000 (even with full cache M is about 8193). The value of DecProb is 1/32.
FIGS. 1 and 2 are representative of prior art in the sense that M>Mmax triggers cache size increase, whether or not that is actually beneficial. In fact, in FIG. 1, the use of the full cache wastes power and does not confer any performance benefit.
The same system with skew enabled is modeled in FIG. 2. The model uses the number FF of Frequent Flyers, the Proportion (Prop) of Frequent Flyers, the Workload (W), and the size of the Cache C to mathematically determine whether or not the hit distribution would be skew (skew=1). The rule used isSkew=one (1)if FF+(1−Prop)*W<0.5*C. Otherwise, skew=0
An approach to cache sizing and performance evaluation is described in U.S. Pat. No. 6,493,810. The concept of “adequate performance” in this patent apparently is predicated on some predetermined formula or analysis. The patent would appear to use an a priori definition of performance on which it determines that performance is low, and increases cache size based on that determination. In particular, the patent estimates cache-miss rates for average user and frequent user. Then it uses an “evaluation of performance impact of disk I/O rate,” which appears to be an analytic estimate of performance based on theoretical miss rates. However, this determination fails to acknowledge whether or not a recent increase in cache size did or did not actually result in improved performance.