Generally, a computer program is an ordered set or sequence of instructions to be processed, or executed, by a computer processor. The processor fetches the program instructions and executes them. Normally, instructions are fetched sequentially, with breaks in the sequence occurring when a branch or jump instruction is encountered. The order in which the instructions are fetches is the program order.
Many modern microprocessors allow instructions to execute out of order. In particular, instructions are executed from a set of already fetched instructions, out of program order, but still require certain dependencies, such as register and memory dependencies, to be preserved. A register dependency results from an ordered pair of instructions where the later instruction needs a register value produced by the earlier instruction. A memory dependency results from an ordered pair of memory instructions where the later instruction reads a value stored in memory by the earlier instruction.
Thus, on the one hand, the out-of-order execution of instructions improved performance because it allows more instructions to complete in the same amount of time by efficiently executing independent operations. However, problems may occur when executing load and store instructions out-of-order.
The terms load, load instruction and load operation instruction are used herein interchangeably and refer to instructions which cause data to be loaded, or read, from memory. This includes the usual load instructions, as well as move, compare, add, and so on where these instructions require the reading of data from memory. Similarly, store, store instruction and store operation instruction are used interchangeably and refer to instructions which cause data to be written to memory.
When a load instruction executes before an older, i.e., earlier fetched, store instruction referencing the same address, the load may retrieve an incorrect value because the data the load should use has not yet been stored at the address by the store instruction. Hardware detects this memory dependency violation, and squashes the load instruction and its subsequent dependent instructions, i.e. these instructions are ignored and must be re-executed (replayed). Because valuable time and resources have been wasted, such hardware recovery degrades processor performance.
Memory reference tagging stores have been proposed in which information from prior memory order violations is used to product subsequent memory dependencies, and consequently to prevent reordering of certain instructions as appropriate. This general method is described in U.S. Pat. No. 5,619,662 (Steely), "Memory Reference Tagging", dated Apr. 8, 1997 (hereinafter '662 patent), and incorporated herein by this reference in the entirety.
The '662 patent describes a means of detecting dependent load and store instructions which have executed out of order, by using a write buffer to keep track of executed load and store instructions until it is determined whether or not they have executed in their proper order. Four different approaches to having such detection of an out-of-order execution are described. In the first, part of the referenced (or target) memory address is used as a tag to be associated with each of the out-of-order instructions. If these instructions layer appear again in the instruction queue, the fact that they have identical tags will cause them to be issued in program order.
The second approach uses an assigned "problem number" as a tag. Two instruction referencing the same memory address and with the same problem number are not re-ordered.
The third approach simply associates a tag bit with an instruction to indicate that other memory reference instructions should not be re-ordered around the tagged instruction.
Finally, the fourth approach turns off reordering for some number of instructions when entering a subroutine.
U.S. Pat. No. 5,615,350 (Hesson), "Apparatus to Dynamically Control the Out-of-Order Execution of Load-Store Instructions in a Processor Capable of Dispatching, Issuing and Executing Multiple Instructions in a Single Processor Cycle," issued Mar. 25, 1997, teaches a store barrier cache for constraining the reordering of loads and stores to reduce memory order violations. A store barrier cache keeps track of stores that tend to cause memory-order violations. The cache is accessed in parallel with the instruction cache, and is notified when a memory-order violation occurs. The store causing the violation is noted in the store carrier cache as problematic. The next time the store is fetched, the processor inhibits all subsequent loads from executing until that store executes. However, processor performance in large machines still suffers because loads which will not cause memory order violations are often prohibited from executing as early as possible, reducing the processor's efficiency.
Moshovos et al., in "Dynamic Speculation and Synchronization of Data Dependence", CS Department, University of Wisconsin-Madison, refines the goals for memory dependence prediction: not only should loads be delayed when necessary to avoid memory-order violations, they should also be released as soon as possible when they can safely execute. Their solution is a set of small, fully-associative tables which hold load/store pairs that have caused memory-order violations in the past, and which are used to synchronize the execution of the two instructions to avoid more violations in the future. However, when the table size is limited to reasonable hardware constraints, there is significantly less performance than an oracle (perfect) memory dependence predictor would provide. In addition, the tables are complex and cumbersome to generate and maintain, and are slow to access.
Moshovos et al. continued their work in "Streamlining Inter-operation Memory Communication via Data Dependence Prediction", MICRO-30, December, 1997, and extended it to using memory dependence prediction to reduce load latency. They do this by a mechanism they call memory cloaking in which a load/store pair has an identifier that is used to bypass the store's data value directly to the consumers of the load. As such, this solution involves the passing of data to the consumer which is typically beyond an instruction scheduler.
Tyson and Austin, "Improving the Accuracy and Performance of Memory Communication through Renaming", MICRO-30, December, 1997, suggest that the memory trap penalty can be effectively reduced by re-executing only the load instruction and its tree of dependent instructions, without affecting other instructions fetched after the load but which are not dependent on it. They combine memory dependency prediction and load value prediction to directly satisfy load instructions without having to access the cache.
Another solution to the problem of reordering dependent load instructions is to remember recently squashed loads, and to force those loads to wait for all prior stores the next time they are fetches.