The demise of Moore's law has been greatly exaggerated and processor speeds continue to double every 18 months. By comparison, memory speed has been increasing at the relatively glacial rate of 10% per year. The unfortunate, though inevitable consequence of these trends is a rapidly growing processor-memory performance gap. Computer architects have tried to mitigate the performance impact of this imbalance with small high-speed cache memories that store recently accessed data. This solution is effective only if most of the data referenced by a program is available in the cache. Unfortunately, many general-purpose programs, which use dynamic, pointer-based data structures, often suffer from high cache miss rates, and are limited by their memory system performance.
Prefetching data ahead of use has the potential to tolerate the growing processor-memory performance gap by overlapping long latency memory accesses with useful computation. Prefetching techniques have been tried with scientific code that access dense arrays in loop nests. However, these techniques rely on static compiler analyses to predict the program's data accesses and insert prefetch instructions at appropriate program points.
However, the reference pattern of general-purpose programs, which use dynamic, pointer-based data structures, is much more complex, and the same techniques do not apply. Thus, a solution for general-purpose programs, especially pointer-chasing code written in languages such as C and C++, remains unknown.
Prefetching is one way to deal with this growing disparity in processor versus memory access speeds. The idea in general is to predict what will be needed and fetch it before it is needed, so the processor will have the data when it is required. As the gap between memory speed and processor speed widens, you need to predict further and further ahead in order to have the data there when the processor needs it. Current prefetch solutions fall into two categories—hardware prefetching and software prefetching.
Hardware prefetching is incorporated in the processors. The problem with hardware prefetching is that it relies on the fact that some programs have spatial locality. Spatial locality labors under the premise that if the program touches some data object, it is next likely to touch another data object in a nearby memory address space. So when a program asks for a data object, the hardware prefetches data objects in the memory space near the fetched object. The problem with the spatial locality assumption, is that it only works for some types of programs. For example, it works with scientific programs which often store information in spatially concentrated arrays. But for many modern programs, which include pointers, that turns out not to be the case. So hardware prefetching does not work very well as a general purpose solution.
Software prefetching statically evaluates the code sequence, and tries to predict what the program will access ahead of time. The problem with this static methodology, occurs when the program under analysis has pointers. Since the pointer targets are not loaded into memory during static analysis, the prefetch addresses are unknown. Thus if the program has pointers in a dependence chain, the static analysis breaks down. Again, programs that use arrays for data storage can benefit from this sort of static code sequence analysis, but for general purpose modern programs, present software prefetch schemes do not work. They cannot determine what addresses the pointers are accessing far enough ahead of time to make the solution viable. Static software analysis breaks down because of the memory access dependencies that can not be resolved statically.
With static software prefetch techniques, the analysis can determine where a pointer points and fetch that address, but that is only one address ahead. For example, in FIG. 1, a static analysis can determine where a data object 102 points 104, and fetch the object 106 at that address. However, that object 106 needs to be fetched, before a pointer 108 to the next object 110 can be determined. This creates a timing dependence chain that is prohibitive, because objects need to be fetched before you can fetch the next object.
The present technology is directed towards dynamically creating and injecting code into a running program. The injected code identifies the first few data fetches in a given hot data stream, and prefetches the remaining data elements in the stream so they are available when needed by the processor. The injected code identifies the first few elements in a hot data stream (i.e. the prefix), and fetches the balance of the elements in the stream (i.e., the suffix). A hot data stream has a couple of valuable properties. First, they are hot, meaning they occur frequently which is good for optimization. Second, they occur over and over again in the same order. So for a hot data stream, once the prefix is seen, the suffix is prefetched so it is in memory by the time needed by the processor needs it. Since the hot data stream identification code and prefetch code is injected at run time, there are no time dependencies for the pointers, since the memory data addresses are known. This is a form of optimization since the data is available sooner.
For an additional optimization, a deterministic finite state machine (DFSM) is built to help create conceptual logic that is injected into the program for prefix identification and suffix prefetching. Further, in one implementation, a DFSM machine is built for each of multiple hot data streams. For a further optimization, a global single DFSM is built for multiple hot data streams. This global DFSM takes advantage of the fact that a global DFSM can reuse states for multiple hot data streams. The global DFSM is used to create conceptual logic that is injected into the executing program. As before, once the elements in the prefix are identified by the injected code, the elements in the corresponding suffix are prefetched by the injected code.
Additional features and advantages will be made apparent from the following detailed description of the illustrated embodiment which proceeds with reference to the accompanying drawings.