1. Field of the Invention
The present invention relates to processors and computing devices. More specifically the present invention relates to a method and apparatus in a processor for prefetching data in arrays.
2. Description of the Related Art
A data prefetch cache is typically used to prefetch large amounts of data having little or no temporal locality without disturbing a conventional first level data cache. The data prefetch cache is thus used for masking load latencies. In many applications such as scientific computation, data prefetch is used to iterate over the elements of a large array with little re-use of accessed elements. The data prefetch is used in these applications to avoid thrashing since, if a first-level cache were used, accessed elements may replace other data that is re-used such as scalar variables in a loop. If such re-used data is replaced in the first-level cache, other data items are repeatedly loaded. The data prefetch is used to avoid thrashing since array elements are prefetched to a data prefetch cache and then loaded from this cache so that the first-level cache is not corrupted by little-used data. Typically, the data prefetch cache is a fully associative cache which is much smaller than the first-level cache. The size of the data prefetch cache is determined by the total number of load operations that can be active at one time.
A conventional data prefetch cache has several disadvantages. One disadvantage of a software-controlled prefetch technique is that an additional prefetch code typically must be inserted either before a loop body or during a loop body, thereby increasing the run-time software burden of the code.
Another disadvantage of a software-controlled prefetch technique is that the number of software execution cycles between a prefetching operation and an operation that uses the data read during the prefetching operation is strictly and statically defined by the code structure while the memory access latency of a data access is variable. If the memory latency exceeds the software execution time, the processor stalls. The strict static definition of code structure is inherently disadvantageous since the code structure cannot adjust to variations in dynamic latencies in accesses of memory. Performance of the processor may suffer due to an increase in processor stalls while the processor awaits a transfer of data from memory. The effect of stalls on processor performance is magnified for operation of software pipelined loops due to accumulations of timing delays.
The aforementioned problem of thrashing is raised, in which the replacement of useful data in the data prefetch cache is possible.
A further disadvantage is that, for a dedicated prefetch buffer, a complicated associative structure is commonly needed.