Computing devices are implemented with a computer processor and memory device, and can also include hierarchical levels of cache memory (referred to herein as a “cache”). A cache is utilized by the computer processor in a computing device to reduce the average data access time from main memory, and the computing device can be implemented with multiple levels of cache that include smaller, faster caches backed-up by larger, slower caches that store copies of data for faster access. For example, a computing device may include a level one (L1) cache, a level (L2) cache (and so on), and the memory device (or main memory). Generally, the processor of the computing device processes load data instructions (LDR) to load data required for a particular operation, and operates to first check the L1 cache for the data, and for a data hit (i.e., the data is contained in the L1 cache), the processor continues operation at a higher speed. If the data is not contained in the L1 cache (referred to as a data miss), the processor checks the next level of cache—e.g., the L2 cache (and so on), before the memory device (or main memory) is checked for the data.
Generally, the load data instructions are processed to access memory (e.g., the cache levels and memory device) in a contiguous manner. If the sequence of accessing memory can be accurately predicted, then cache lines of the data can be copied to one or more levels of the cache, thus increasing the cache hit rate. This is commonly referred to as prefetch, and a cache line (e.g., a cache block of data) can be copied from the memory device into cache. The prefetch mechanism requests cache lines that are copied to the localized caches well in advance of the load data instructions being processed to access the cache data from a particular cache line.
A software program that is executed by a processor of a computing device may process the load data instructions in recursive loops that access memory in an incremental order, such as to access a memory location that is incremented every time the loop is incremented. For example, load data instructions are generated by a loop that increments a memory address, such as for an instruction (LDR) that accesses a memory address {0x0,0x4,0x8} as LDR 0x0, LDR 0x4, LDR 0x8, . . . LDR 0x120, LDR 0x124, etc. In this example, the LDR instructions are generated when a loop increments the address by four bytes (a word), and the software program accesses a contiguous word every iteration. Because the distance between each instruction is four bytes, the prefetch mechanism can generate a prefetch access to obtain the word well ahead of a load data instruction for the word being executed.
The distance between each memory address for each respective LDR instruction is commonly called the stride, and may be any value from one byte to multiples of a cache line. A computing device can implement stride detection logic that calculates the stride between an observed load address and the last accessed address for a particular stream of load data instructions, as tracked by a program counter. Once the stride is calculated, multiple predicted preload instructions for incremental values of the stride can be processed to prefetch the corresponding data. For example, the incremental memory access for a cache line can be represented as {0,1,2,3,4,5,6} and the stride [1,1,1,1,1,1] is one (1) as calculated by the stride detection logic, where future instruction transactions can be predicted as {9,10,11,12,13,14 . . . }. As in this example, the stride detection forms the basis of the prefetch logic, and if the stride detection logic detects that the stride is no longer the same as the previous history of strides, the prefetch logic stops prefetching the predicted data.
The stride distance calculation is straight-forward when the stream of load data instructions is in order. Further, and to achieve a higher CPU performance, memory disambiguation can be implemented that allows for the load data instructions to be out of order for data access in memory. A common technique for memory disambiguation is Read-after-Read (RAR) disambiguation, which provides that earlier loads can outrun later loads, even if the memory address dependencies are not known. The address dependencies are resolved later once the physical address of the later instruction is determined.
However, since the load data instructions may be out of order, the stride that is calculated between the instructions is not accurate for every disambiguation, which interrupts the prefetch logic. For example, the memory access for a cache line can be represented as {0,3,4,1,6,2,5} and the stride is [3,1,−3,5,−4,3]. The stride detection is not uniform and in this example, memory access 1 and 2 are pushed aside for disambiguation to allow for access 3 and 4 to pass through before access 1 is loaded into the data stream. Similarly, access 5 is pushed aside for disambiguation to allow for access 6 to pass through before access 2 and 5 are loaded into the data stream. In this example, the stride detection is still one (1), however the prefetch logic cannot determine the next predicted stride and will interrupt prefetching due to the non-uniform detection of the load instructions being out of order. The non-ordered memory disambiguation interrupts prefetching the data for subsequent LDR instructions which can then increase the average data access time.