In computers, a memory hierarchy is typically utilized. In the memory hierarchy, multiple levels of memory are included, each having a different access latency. To this extent, memory at the lowest level has the fastest access latency while memory at the highest level has the slowest access latency. In general, the faster the memory, the more expensive it is to implement. Therefore, a lower memory level will typically have less storage space than a higher memory level. An illustrative memory hierarchy includes: very high speed machine registers; an L1 cache, which is a small amount of very fast memory located on or near a processor chip; an L2 cache, which is fast memory located adjacent to the processor; random access memory (RAM), which is dynamic memory located in the computer; a hard drive, which is relatively slow, but voluminous storage; and remote storage, which is typically used for backup, long term storage and the like. In this manner, a significant amount of storage space can be provided in a computing environment while the cost of the memory as a percentage of the overall cost of the computing environment remains reasonable.
Typically, the processing speed of the processor is approximately the same as that of the access speed of the fastest memory (e.g., L1 cache). As a result, when data is available in the fastest memory, the processor does not need to wait for the data, thereby avoiding processor stalls. However, several processing cycles can be lost when the required data is not available in the fastest memory level (i.e., a memory miss in that memory level) and therefore must be retrieved from a slower memory level (e.g., RAM). In this case, the data is copied from the slower memory to the fastest memory, where it is made available to the processor. This can result in the execution of the application pausing for several processing cycles while the data is being transferred between memory levels, degrading the application performance. In general, recent increases in the processing speed of processors have exceeded increases in the access speed of various types of memories, thereby escalating the importance of memory misses to the processing performance of the computing environment.
Numerous approaches are used that seek to reduce the frequency of memory misses and/or tolerate memory access latency. A common strategy to hide memory access latency is data prefetching. Data prefetching includes copying data from a slower memory to a faster memory, such as an L1 cache, before it is required by an application being executed. Subsequently, when the application requests the data, it will be available in the fast memory and can be processed without waiting for the data to be copied from the slower memory.
In general, data prefetching can be accomplished by software alone, hardware alone, or a combination of the two. Software-based data prefetching relies on compile-time analysis to insert and schedule explicit fetch instructions within an application. Hardware-based data prefetching employs special hardware in a computer that monitors the storage reference patterns of the application in an attempt to infer prefetching opportunities.
However, both approaches include some drawbacks. For software-based data prefetching, the inserted instructions require execution by the main processor, thereby adding to the overhead for executing the application. While hardware-based data prefetching does not require any additional instructions, it is frequently less accurate than software-based data prefetching since it relies only on execution-time information to predict data accesses, rather than on compile-time information, and therefore can detect only the simplest access patterns (i.e., constant stride). To this extent, in some approaches, a combination of software and hardware-based data prefetching is utilized to take advantage of compile-time application information to direct a hardware prefetcher, thereby reducing an overall amount of software overhead. However, these approaches are limited to data prefetching at a single level of memory, e.g., an L1 cache.
One problem with prefetching data is cache pollution, in which data that is not immediately required by the application is prefetched to a cache, causing a cache miss when the application requests other data that is required sooner. Conventional data prefetching approaches are effective at hiding latency due to cache misses for application data that is stored using regular data structures, such as an array, where memory access strides can be determined by a compiler. Similarly, using dynamic value profiling, conventional data prefetching approaches also can be effective for applications that have regular memory access strides at runtime.
However, many applications (e.g., non-numeric applications) use recursive data structures, such as linked lists, trees, graphs, and the like. These data structures employ pointers to link data nodes and form the overall data structure. At runtime, these data structures frequently have irregular memory access strides, resulting in very poor spatial data locality. As a result, the effectiveness of conventional data prefetching approaches is limited.
Some proposals seek to effectively prefetch data for irregular memory access strides. For example, some approaches sequentially prefetch pointer chains using natural pointers in the linked list. However, these approaches do not exploit any memory parallelism, limiting their effectiveness. Other approaches insert additional pointers into the linked list to connect non-consecutive link elements. In these approaches, jump pointers are inserted at compile time to create memory parallelism. However, the jump pointers increase runtime overhead and also can contribute to additional cache misses. Still other approaches seek to predict cache misses or prefetch multiple chains of pointers sequentially. However, the effectiveness of these approaches is also limited for some applications.
In view of the foregoing, a need exists to overcome one or more of the deficiencies in the related art.