The performance of a computer program is usually difficult to characterize. Programs do not perform uniformly well or uniformly poorly. Rather, programs have stretches of adequate performance punctuated by performance-degrading events. The overall observed performance of a specific program depends on the frequency of such events and their relationship to one another and to the rest of the program.
Program performance is measured by retirement throughput. Since retirement throughput is sequential, the presence of a performance-degrading event, such as a long latency instruction, blocks retirement and degrades performance. Some examples of performance-degrading long latency instructions include branch mispredictions and instruction and data cache misses.
Several solutions have been proposed to reduce the frequency and observed latency of these performance-degrading events. For example, one solution focuses on running a subset of the instructions that feed to the performance-degrading events ahead of the general execution of the program in order to resolve the performance-degrading events, by detecting the outcomes of branches and prefetching the needed data into the cache. This approach can improve performance only if one can identify a small subset of the program that can be issued sufficiently early to resolve the events with enough accuracy. This approach also requires additional hardware, for example a separate pipeline that would allow the identified subset to run ahead. However, identification of a minimal program subset with maximum accuracy requires a sophisticated program analysis and the hardware is typically constrained by a limited program scope and the simplicity of attainable analysis.