Data prefetching is a technique often employed by computer processors to improve execution performance by retrieving data from slow-access storage, typically main memory, to fast-access local storage, typically cache memory, before the data are actually needed for processing. Data prefetching strategies typically leverage situations in which sequential data items are stored contiguously in statically-allocated memory, such as is typically the case with array-based data that are to be retrieved and processed in the order in which they are stored. For example, when the following programming loop is used to access a data array:
for (int i=0; i<1024; i++) { array1[i] = array1[i] + 1;}the i-th element of the array “array1” is accessed at each iteration. Thus, array elements that are going to be accessed in future iterations may be prefetched before the future iterations occur.
In hardware-based prefetching, a computer processor includes a mechanism that monitors the stream of instructions of a program during its execution, recognizes elements that the program might access in the future based on this stream, and prefetches such elements into the processor's cache. In the above programming loop example, a type of hardware-based prefetching known as “strided prefetching” may be used to identify instructions for which data are accessed at a computer memory address, determine that the same instruction at the same instruction address is executed multiple times, where each time data are accessed at a different computer memory address, and determine the number of intermediate addresses from one such computer memory address to the next, known as a “stride.” Once a consistent stride pattern is established for such an instruction at a given instruction address, data may be prefetched from computer memory addresses that are multiple strides ahead of the computer memory address most recently accessed by the instruction. In order to monitor such instructions, hardware-based strided prefetching mechanisms typically maintain a stride-tracking record in a history table of such records for each such instruction, the stride-tracking record indicating the address of the instruction and tracking the stride between the computer memory addresses accessed each time the same instruction is executed. A consistent stride typically takes three iterations of a prefetching candidate instruction, where its stride is determined during the second iteration and is verified during the third iteration. Thus, in the above example, if a consistent stride is verified when array1[2] is fetched from computer memory, prefetching can be begun starting with the computer memory location at the next stride.
Unfortunately, hardware-based strided prefetching is complicated by optimizing compilers that attempt to improve a program's execution performance by employing “loop unrolling” techniques, whereby loop instructions that would otherwise be performed in repeated iterations are transformed into a repeated sequence of instructions that require fewer iterations. Thus, in the above programming loop example, the loop may be transformed into separate instructions in a loop-unrolled format equivalent to the following instructions:
for (int i=0; i<1024; i+5) { array1[i] =array1[i] + 1; array1[i+1] = array1[i+1] + 1; array1[i+2] = array1[i+2] + 1; array1[i+3] = array1[i+3] + 1; array1[i+4] = array1[i+4] + 1;}
If hardware-based strided prefetching is then applied in the manner described above, since each of the array access instructions above will be transformed into five corresponding instructions requiring memory access, each having a different instruction address, five separate stride-tracking records will be required to track the strides between the computer memory addresses accessed by their corresponding instructions. Where a computer processor is configured with a limited number of stride-tracking records, this can result in thrashing of the history table, aliasing when mapping instruction addresses to stride-tracking records, or contention, any of which may result in reducing the effectiveness of the hardware-based prefetching mechanism. Also, given that in the loop-unrolled example above a consistent stride can only be verified for instruction array1 [i]=array1 [i]+1 during its third iteration, when fetching array1[10], prefetching might not even occur when short loops are loop-unrolled.