With processor speed increasing much more rapidly than memory access speed, there is a growing performance gap between processor and memory in computers. More particularly, processor speed continues to adhere to Moore's law (approximately doubling every 18 months). By comparison memory access speed has been increasing at the relatively glacial rate of 10% per year. Consequently, there is a rapidly growing processor-memory performance gap. Computer architects have tried to mitigate the performance impact of this imbalance with small high-speed cache memories that store recently accessed data. This solution is effective only if most of the data referenced by a program is available in the cache. Unfortunately, many general-purpose programs, which use dynamic, pointer-based data structures, often suffer from high cache miss rates, and therefore are limited by memory system performance.
Due to the increasing processor-memory performance gap, memory system optimizations have the potential to significantly improve program performance. One such optimization involves prefetching data ahead of its use by the program, which has the potential of alleviating the processor-memory performance gap by overlapping long latency memory accesses with useful computation. Successful prefetching is accurate (i.e., correctly anticipates the data objects that will be accessed in the future) and timely (fetching the data early enough so that it is available in the cache when required). For example, T. Mowry, M. Lam and A Gupta, “Design And Analysis Of A Compiler Algorithm For Prefetching,” Architectural Support For Programming Languages And Operating Systems (ASP-LOS) (1992) describe an automatic prefetching technique for scientific codes that access dense arrays in tightly nested loops, which relies on static compiler analyses to predict the program's data accesses and insert prefetch instructions at appropriate program points. However, the reference pattern of general-purpose programs, which use dynamic, pointer-based data structures, is much more complex, and the same techniques do not apply.
An alternative to static analyses for predicting data access patterns is to perform program data reference profiling. Recent research has shown that programs possess a small number of “hot data streams,” which are data reference sequences that frequently repeat in the same order, and these account for around 90% of a program's data references and more than 80% of cache misses. (See, e.g., T. M. Chilimbi, “Efficient Representations And Abstractions For Quantifying And Exploiting Data Reference Locality,” Proceedings Of The ACM SIGPLAN '01 Conference On Programming Language Design And Implementation (June 2001); and S. Rubin, R. Bodik and T. Chilimbi, “An Efficient Profile-Analysis Framework For Data-Layout Optimizations,” Principles Of Programming Languages, POPL'02 (January 2002).) These hot data streams can be prefetched accurately since they repeat frequently in the same order and thus are predictable. They are long enough (15–20 object references on average) so that they can be prefetched ahead of use in a timely manner.
In prior work, Chilimbi instrumented a program to collect the trace of its data memory references; then used a compression technique called Sequitur to process the trace off-line and extract hot data streams. (See, T. M. Chilimbi, “Efficient Representations And Abstractions For Quantifying And Exploiting Data Reference Locality,” Proceedings Of The ACM SIGPLAN '01 Conference On Programming Language Design And Implementation (June 2001).) Chilimbi further demonstrated that these hot data streams are fairly stable across program inputs and could serve as the basis for an off-line static prefetching scheme. (See, T. M. Chilimbi, “On The Stability Of Temporal Data Reference Profiles,” International Conference On Parallel Architectures And Compilation Techniques (PACT) (2001).) However, this off-line static prefetching scheme may not be appropriate for programs with distinct phase behavior.
Dynamic optimization uses profile information from the current execution of a program to decide what and how to optimize. This can provide an advantage over static and even feedback-directed optimization, such as in the case of the programs with distinct phase behavior. On the other hand, dynamic optimization must be more concerned with the profiling overhead, since the slow-down from profiling has to be recovered by the speed-up from optimization.
One common way to reduce the overhead of profiling is through use of sampling: instead of recording all the information that may be useful for optimization, sample a small, but representative fraction of it. In a typical example, sampling counts the frequency of individual events such as calls or loads. (See, J. Anderson et al., “Continuous Profiling: Where Have All The Cycles Gone?,” ACM Transactions On Computer Systems (TOCS) (1997).) Other dynamic optimizations exploit causality between two or more events. One example is prefetching with Markov-predictors using pairs of data accesses. (See, D. Joseph and D. Grunwald, “Prefetching Using Markov Predictors,” International Symposium On Computer Architecture (ISCA) (1997).) Some recent transparent native code optimizers focus on single-entry, multiple-exit code regions. (See, e.g., V. Bala, E. Duesterwald and S. Banerjia, “Dynamo: A Transparent Dynamic Optimization System,” Programming Languages Design And Implementation (PLDI) (2000); and D. Deaver, R. Gorton and N. Rubin, “Wiggins/Redstone: An On-Line Program Specializer,” Hot Chips (1999).) Another example provides cache-conscious data placement during generational garbage collection to lay out sequences of data objects. (See, T. Chilimbi, B. Davidson and J. Larus, “Cache-Conscious Structure Definition,” Programming Languages Design And Implementation (PLDI) (1999); and T. Chilimbi and J. Larus, “Using Generational Garbage Collection To Implement Cache-Conscious Data Placement,” International Symposium On Memory Management (ISMM) (1998).) However, for lack of low-overhead temporal profilers, these systems usually employ event profilers. But, as Ball and Larus point out, event (node or edge) profiling may misidentify frequencies of event sequences. (See, T. Ball and J. Larus, “Efficient Path Profiling,” International Symposium On Microarchitecture (MICRO) (1996).)
The sequence of all events occurring during execution of a program is generally referred to as the “trace.” A “burst” on the other hand is a subsequence of the trace. Arnold and Ryder present a framework that samples bursts. (See, M. Arnold and B. Ryder, “A Framework For Reducing The Cost Of Instrumented Code,” Programming Languages Design And Implementation (PLDI) (2001).) In their framework, the code of each procedure is duplicated. (Id., at FIG. 2.) Both versions of the code contain the original instructions, but only one version is instrumented to also collect profile information. The other version only contains checks at procedure entries and loop back-edges that decrement a counter “nCheck,” which is initialized to “nCheck0.” Most of the time, the (non-instrumented) checking code is executed. Only when the nCheck counter reaches zero, a single intraprocedural acyclic path of the instrumented code is executed and nCheck is reset to nCheck0.
A limitation of the Arnold-Ryder framework is that it stays in the instrumented code only for the time between two checks. Since it has checks at every procedure entry and loop back-edge, the framework captures a burst of only one acyclic intraprocedural path's worth of trace. In other words, only the burst between the procedure entry check and a next loop back-edge is captured. This limitation can fail to profile many longer “hot data stream” bursts, and thus fail to optimize such hot data streams. Consider for example the code fragment:
for (i=0; i<n; i++)                if ( . . . ) f( );        else g( );Because the Arnold-Ryder framework ends burst profiling at loop back-edges, the framework would be unable to distinguish the traces fgfgfgfg and ffffgggg. For optimizing single-entry multiple-exit regions of programs, this profiling limitation may make the difference between executing optimized code most of the time or not.        
Another limitation of the Arnold-Ryder framework is that the overhead of the framework can still be too high for dynamic optimization of machine executable code binaries. The Arnold-Ryder framework was implemented for a Java virtual machine execution environment, where the program is a set of Java class files. These Java programs typically have a higher execution overhead, so that the overhead of the instrumentation checks is smaller compared to a relatively slow executing program. The overhead of the Arnold-Ryder framework's instrumentation checks may make dynamic optimization with the framework impractical in other settings for programs with lower execution overhead (such as statically compiled machine code programs).
A further problem is that the overhead of hot data stream detection has been overly high for use in dynamic optimization systems, such as the Arnold-Ryder framework.