Processors, such as modern high-performance processors, are designed to execute a large number of instructions per clock cycle. Certain instructions produce a result only after a potentially large number of cycles. Such instructions may be known as “long latency” instructions, as a long time interval exists between the time an instruction is delivered and when it is executed. A long latency may occur, for example, when data required by an instruction needs to be loaded from a high level of memory. Such a load operation therefore may have a “load-use” penalty associated with it. That is, after a program issues such a load instruction, the data may not be available for multiple cycles, even if the data exists (i.e., “hits”) in a cache memory associated with the processor.
Processors typically allow execution to continue while a long latency instruction is outstanding. Often, however, data is needed relatively soon (e.g., within several clock cycles) because insufficient work remains to be done by the processor without the requested data. Accordingly, a need exists to improve processor performance in such situations.