The typical computer has a random access memory hierarchy including one or more levels of on-processor cache memory, a main memory (located off of the processor chip) and a mass storage device (e.g., a hard disk drive, etc.). Typically, accessing the first level of cache memory (L1 cache) is fastest (i.e., has the lowest latency) and accessing the mass storage device is slowest. The latencies associated with accessing intermediate levels of the memory hierarchy fall between these two extremes of memory access latencies. In addition to increasing in latency time, the various levels of the memory hierarchy typically increase in size from the highest level of the memory hierarchy to the lowest level of memory hierarchy.
In the typical case, cache memory has an inclusive nature. Thus, when data is retrieved from a given level of the memory system (e.g., a hard disk drive), it is written into all levels of the cache (e.g., the L1 cache, the level 2 (L2) cache, the level 3 (L3) cache, etc.). This practice maximizes the likelihood that data needed for a later instruction is present in the highest levels of the cache, thereby reducing the number of accesses to slower memory resources and the number of cache misses (i.e., a failed attempt to retrieve data from a cache level that does not contain the desired data).
Architectural or micro-architectural effects often cause performance penalties when accessing the memory hierarchy. The best-known example of such effects is the classic load wherein data is loaded from some level of memory to the processor. The latency of such a load operation may range from one cycle (assuming the first level cache has a latency of 1) to hundreds of cycles, depending on which memory level currently contains the data to be loaded. The fact that the actual latency varies means that, in order to avoid stalls, uses of the data being loaded is delayed by an amount of time that is hard, if not impossible to predict.
Out-of-order processors solve this issue by reordering load uses until the data being loaded is available. In other words, load uses are frozen until their data is ready for use (i.e., actually located in the processor's registers), while letting other instructions execute. This freezing feature is built into the hardware, and the software does not have to worry about it. The decision to freeze or delay a use is done dynamically (i.e., at execution time).
In-order processors lack this dynamic reordering feature. Instead, software for in-order processors must decide beforehand, that is statically, when they schedule a load's use. (Typically, the software is a compiler, so the expressions “statically” and “at compile-time” can be used interchangeably.) Aggressive static techniques typically bet on the best case and, thus, schedule the use instruction at the earliest possible time immediately after the load is executed. A penalty is then paid each time the load takes more than that time to complete, since usually all instructions following the use are delayed as well. Defensive software separates the use from the load by an amount of time to account for expected delays. When such delays do not happen, precious cycles have been spent waiting uselessly. Prior art in-order processors force one to choose a strategy, aggressive or defensive, once (i.e., at compile time) and to stay with that choice for all run-time executions of the load-use sequence.
Software techniques having some similarity to the techniques disclosed below are known. For example, code versioning is a technique that develops specialized copies of software code and inserts decision code to determine which version of the code executes. In this technique, the two versions of the code being versioned are different from one another. Frequently, one version is highly simplified for the relevant special case (e.g., “multiply X*Y” may be versioned such that, if either X or Y=0, then the multiplication step is skipped). Versioning is not meant to hide micro-architectural effects, but is instead used to expedite code execution. Code versioning has, however, been used to address high-level power effects (e.g., execute one set of code if the computer is connected to a source of commercial power, but execute a stripped-down version of the code if the computer is running on batteries to reduce power consumption.)
As another example, instruction fusion is a technique which replaces two identical instructions predicated by opposite conditions (e.g., instruction (1) if X, do Y, instruction (2) if not X, do Y) with one instruction with no predicate (e.g., do Y). The goal of this technique is to reduce code size.