The present invention relates, in general, to the field of microprocessor design and the issuance of memory instructions therefor. More particularly, the present invention relates to a system, apparatus and method for restraining over-eager load boosting past stores in an out-of-order processor.
Early computer processors (also called microprocessors) included a central processing unit or instruction execution unit that executed only one instruction at a time. As used herein the term processor includes complete instruction set computers ("CISC"), reduced instruction set computers ("RISC") and hybrids. In response to the need for improved performance several techniques have been used to extend the capabilities of these early processors including pipelining, superpipelining, superscaling, speculative instruction execution, and out-of-order instruction execution.
Pipelined architectures break the execution of instructions into a number of stages where each stage corresponds to one step in the execution of the instruction. Pipelined designs increase the rate at which instructions can be executed by allowing a new instruction to begin execution before a previous instruction is finished executing. Pipelined architectures have been extended to "superpipelined" or "extended pipeline" architectures where each execution pipeline is broken down into even smaller stages (i.e., microinstruction granularity is increased). Superpipelining increases the number of instructions that can be executed in the pipeline at any given time.
"Superscalar" processors generally refer to a class of microprocessor architectures that include multiple pipelines that process instructions in parallel. Superscalar processors typically execute more than one instruction per clock cycle, on average. Superscalar processors allow parallel instruction execution in two or more instruction execution pipelines. The number of instructions that may be processed is increased due to parallel execution. Each of the execution pipelines may have differing number of stages. Some of the pipelines may be optimized for specialized functions such as integer operations or floating point operations, and in some cases execution pipelines are optimized for processing graphic, multimedia, or complex math instructions.
The goal of superscalar and superpipeline processors is to execute multiple instructions per cycle ("IPC"). Instruction-level parallelism ("ILP") available in programs can be exploited to realize this goal, however, this potential parallelism requires that instructions be dispatched for execution at a sufficient rate. Conditional branching instructions create a problem for instruction fetching because the instruction fetch unit ("IFU") cannot know with certainty which instructions to fetch until the conditional branch instruction is resolved. Also, when a branch is detected, the target address of the instructions following the branch must be predicted to supply those instructions for execution.
Recent processor architectures use a branch prediction unit to predict the outcome of branch instructions allowing the fetch unit to fetch subsequent instructions according to the predicted outcome. Branch prediction techniques are known that can predict branch outcomes with greater than 95% accuracy. These instructions are "speculatively executed" to allow the processor to make forward progress during the time the branch instruction is resolved. When the prediction is correct, the results of the speculative execution can be used as correct results, greatly improving processor speed and efficiency. When the prediction is incorrect, the completely or partially executed instructions must be flushed from the processor and execution of the correct branch initiated.
Early processors executed instructions in an order determined by the compiled machine-language program running on the processor and so are referred to as "in-order" or "sequential" processors. In superscalar processors multiple pipelines can simultaneously process instructions only when there are no data dependencies between the instructions in each pipeline. Data dependencies cause one or more pipelines to "stall" waiting for the dependent data to become available. This is further complicated in superpipelined processors where, because many instructions are executed simultaneously in each pipeline, the potential quantity of data dependencies is large. Hence, greater parallelism and higher performance are achieved by "out-of-order" processors that include multiple pipelines in which instructions are processed in parallel in any efficient order that takes advantage of opportunities for parallel processing that may be provided by the instruction code.
Although out-of-order processing greatly improves throughput, it also increases complexity as compared to simple sequential processors. In fact, due to out-of-order instruction scheduling, a load instruction may be boosted past an older store which stores data to the same location. In this case, the load may hit in the cache and return an older value. Such cases need to be detected and corrected to ensure correct program execution.
Because of this, conventional process design simply allowed for memory instructions to be issued in order so that they can resolve in order. Nevertheless, in those instances where register space is unavailable and program code issues a store to a certain memory location instead (i.e. "register spilling")and then loads from the same location it is possible that the program compiler may not be able to determine that the store and load would map to the same memory location. This is referred to as a lack of memory disambiguation or lack of knowledge as to how the program addresses itself with what is actually mapped. Stated another way, the memory location map was constructed without knowledge that the store and load instructions were directed to the same memory location.
In out-of-order instruction execution, memory instructions may be issued in any order, including loads and stores. In this regard, copending U.S. patent application Ser. No. 08/882,311 entitled AN APPARATUS AND METHOD FOR MAINTAINING PROGRAM CORRECTNESS WHILE ALLOWING LOADS TO BE BOOSTED PAST STORES IN AN OUT-OF-ORDER MACHINE and identified as Docket No. P2365/37178.830080.000 filed concurrently herewith by Ramesh Panwar, P. K. Chidambaran, and Ricky C. Hetherington discloses a system, apparatus and method for ensuring program correctness in an out-of-order processor despite younger loads being boosted past an older store, through the use of a memory disambiguation buffer ("MDB"). The memory disambiguation buffer stores all memory operations that have not yet been retired. Each entry has several fields amongst which are the data and the addresses of the memory operations. An incoming load checks its address against the addresses of all the stores. If there is a match against an older store, then the load must have received old data from the data cache and the load operation is replayed to seek data from the memory disambiguation buffer on the replay. If on the other hand, there were no matches on any older store, the load is assumed to have received the right data from the data cache (assuming a data cache hit). An incoming store checks its address against the addresses of all younger loads. If there is a match against any younger load, then the younger load is replayed along with all of its dependents.
Through the use of this type of speculative out-of-order execution of memory instructions, younger loads can be boosted past older stores and the memory disambiguation buffer exists to allow correct program execution. The memory disambiguation buffer replays (or rewinds) any load (and its chain of dependents) that were boosted past a colliding store due to over-eager scheduling. However, a penalty is exacted on the execution performance due to the rewind and replay of such loads and their dependents. Therefore, a means for restraining over-eager load boosting is required.