In computing, faster memory (cache) is more expensive than slower memory (such as RAM). Accordingly, for bulk storage, slower memory is often employed. When data is needed/used by a processor, faster response is achieved if the data can be obtained from the cache (a cache “hit”) rather than the slower memory. Cache is logically placed between a processor(s) and the slower memory. Some systems employ multiple layers of caches that become increasingly fast according to their logical placement relative to the processor. Accordingly, the cache is queried for required data. If the cache does not have the required data (a cache “miss”), the cache obtains it from the slower memory (which likewise obtains it from slower memory if not already present). The data is then kept in the cache for so long as the faster memory has room for it. When the cache is full, data must be removed therefrom to allow additional data to be stored therein. Accordingly, various algorithms exist to determine what data is pushed out of the cache (replacement policies).
One such algorithm is the least recently used (LRU) algorithm. The above construct provides that subsequent calls for data that are already present in the cache (cache hit) can quickly serve the data, thereby speeding up the process. Each memory element of memory that can be transferred to the cache is called a cache line. Cache hits often result in the desired data being provided to a processor in a few cycles. Cache misses, however, often require many more cycles to get the requested data to the processor. Caches can be used for both reads and writes by the processor.
Accordingly, the above process only provides such speed advantages when the data is located in the cache. In the above example, this occurs on second and subsequent calls for the data. The speed advantages are not present for the first time that data is called. To alleviate this problem, prefetch algorithms have been developed that attempt to predict what data (e.g., instructions or data) will be needed and place that data in the cache before it is called for (predictive algorithms). As previously noted, including one piece of data in the cache usually requires expulsion of another piece of data. Thus, it is not feasible to put all data in the cache. Furthermore, predictive systems typically require additional hardware to compute the predictive algorithms. Accordingly, the time saved by successful prediction must be great enough to overcome the time and hardware invested in generating the prediction. Existing prefetching solutions generally use large tables to identify known addresses used in the past and to provide complex rules that dictate prefetching based on a multitude of variables.
For example, stride prefetchers and Markov prefetchers generally use large tables to identify the known addresses and such techniques do not generally capture the overall spatial locality of workloads being carried out by the one or more processors and can potentially be very conservative. For example, stride prefetching algorithms are known which determine if a cache miss occurs and predicts that an address that is offset by a distance from the missed address is likely to be missed in the near future. As such, when an address miss occurs, an address is prefetched that is offset by a distance from the missed address. Also if desired, when there is a hit in the buffer, a prefetch of the address that is offset by a distance from the hit address can also be obtained. These prefetchers typically use large tables and may incur additional hardware and/or power overheads.
Accordingly, there exists a need for a device and method that provide an improved predictive caching technique.