In an Out-Of-Order (“OOO”) microprocessor, instructions are allowed to issue and execute out of their program order. For example, the scheduler in an OOO processor can issue and execute a load before a previous store writes to the same memory location. Memory dependencies can therefore exist between such a load and a prior store that needs to access the same memory address. The OOO machine typically needs to address the errors that result from a load returning wrong data because it attempts to access a memory location before a prior occurring store has written to the same memory location. This type of error is known as a memory read-after-write (RAW) violation. Further, the load and store pair is typically referred to as “aliasing” because it is targeting the same memory address.
Memory disambiguation is a set of techniques employed by high-performance out-of-order execution microprocessors that execute memory access instructions (loads and stores) out of program order. The mechanisms for performing memory disambiguation, implemented using digital logic inside the microprocessor core, detect true dependencies between memory operations at execution time and allow the processor to recover when a dependence has been violated. They also eliminate spurious memory dependencies and allow for greater instruction-level parallelism by allowing safe out-of-order execution of loads and stores.
FIGS. 1A-1B illustrate a typical RAW violation that can arise as a result of out-of-order execution of loads and stores in an OOO machine. FIG. 1A shows that a later load instruction 152 loads data from the same memory address [0x4000] 153 that a previous store 151 writes to. Accordingly, the load 152 should store the same value in register r3 as that stored in register r5 by store instruction 151. As shown in FIG. 1B, if the load instruction 151 is executed before the store, it will load incorrect data. This is known as a RAW violation as discussed above. In order to recover from the violation, the pipeline will need to be flushed and the load instruction along with other instructions dependent on it will need to re-execute. Because of the high computational penalty of a flush operation, it is important to avoid this kind of re-execution from happening in a high performance CPU.
FIG. 2 illustrates a more detailed example of how a conventional OOO microprocessor handles a memory read-after-write (“RAW”) violation. Instruction 1 257, Instruction 2 258, Instruction 3 259, and Instruction 4 260 are in program order. However, in an OOO machine, Instruction 3 259, the load instruction, can execute during cycle 2 before Instruction 2 258, which is a store instruction that executes in cycle 4 and accesses the same memory location [0x4000] as the load instruction 259. If load instruction 259 executes in a prior cycle from the store instruction 258, it will get wrong data from memory location [0x4000]. Accordingly, the wrong data will be stored in register r9 by load instruction 259. Further, Instruction 4 260 may execute in cycle 3 using the wrong data from the load instruction 259.
In order to correct the errors resulting from this RAW violation, both instructions 259 and 260 are invalidated and need to re-execute following a pipeline flush. The load instruction 259 will receive the correct data from the store instruction 258 during the re-execution, however, a severe computational penalty is paid in order to perform the pipeline flush and re-execution.
FIG. 3 illustrates a pipeline for a conventional OOO microprocessor. Instructions are fetched at the fetch stage 302 and placed in the instruction fetch queue (IFQ) (not shown) within fetch stage 302. The instructions are generally the original assembly instructions found in the executable program. These instructions reference the architectural registers which are stored in register file 310. If the first fetched instruction was to be interrupted or raise an exception, the architectural register file 310 stores the results of all instructions until that point. Stated differently, the architectural register file stores the state that needs to be saved and restored in order to return back to the program during debugging or otherwise.
In an OOO microprocessor, the instructions execute out of order while still preserving data dependence constraints. Because instructions may finish in an arbitrary order, the architectural register file 310 cannot be modified by the instructions as they finish because it would make it difficult to restore their values accurately in the event of an exception or an interrupt. Hence, every instruction that enters the pipeline is provided a temporary register where it can save its result. The temporary registers are eventually written into the architectural register file in program order. Thus, even though instructions are being executed out of order, the contents of the architectural register files change as though they were being executed in program order.
The ROB 308 facilitates this process. After the instructions are dispatched from the fetch unit 302, they are decoded by decode module 304 and are placed in the ROB 308 and issue queue 306 (IQ). The ROB 308 and IQ 306 may be part of a scheduler module 372. As instructions are issued out of IQ 306 out of order, they are executed by execute module 312.
In one embodiment, the write back module 314 will write the resulting values from those instructions back to the temporary registers in ROB 308 and rely on the ROB 308 to facilitate committing the instructions in order. However, in a different embodiment, write back module 314 writes the values resulting from instruction execution directly into register file 310 without sorting them. The unordered elements are added in physical memory to the register file 310 in an unordered fashion and are then retired to the architectural files in order at the retirement stage using a ROB initiated protocol.
The instructions issued out of order from the IQ 306 may also comprise loads and stores. As explained above, when loads and stores are issued out of order from the IQ 306, there are memory dependencies between them that need to be resolved before those instructions can be committed. Accordingly, the store instructions are stored in order in a Load Store Queue (LSQ) 316 while the dependencies between the loads and stores are resolved with the help of ROB 308.
A load instruction uses registers in the register file 310 to compute an effective address and, subsequently, brings the data from that address in memory 318 into a register in register file 310. The store similarly uses registers in the register file 310 to compute an effective address, then transfers data from a register into that address in memory 318. Hence, loads and stores must first wait for register dependencies to be resolved in order to compute their respective effective address. Accordingly, each store instruction is queued in order in a load/store queue (LSQ) 316 while it is waiting for a register value to be produced—when it receives the broadcast regarding its availability, the effective address computation part of the store is issued.
The Load Store Queue (“LSQ”) is a component in a conventional OOO microprocessor pipeline that aids memory disambiguation. One of the key requirements for the LSQ is availability of information that allows age order determination between loads and stores. Stated differently, the LSQ requires information that allows it to order the various loads and stores based on age. For example, for a memory load operation to successfully complete, the LSQ must confirm that all stores, older in age order present no RAW hazard and no younger loads incorrectly create hazards with loads to the same address.
In an In-Order machine, this is a relatively easy design because the operations arrive at the LSQ in program order and, consequently, also in age order. In an OOO processor, however, the memory operations arrive at LSQ out of order. One of the problems this causes is that it requires all the stores older to a certain load operation to be monitored by some module within the microprocessor pipeline, e.g., a scheduler 372. This needs to be done to assist the load operation in determining the completion status of all older stores which in turn is needed to make a final decision as to whether the data the load operation has acquired is correct or whether it has the potential to encounter a hazard with a conflicting store.
Store instructions are queued in order in a LSQ of a conventional OOO processor because when stores are issued out of order from the IQ 306, there are memory dependencies between loads and the store instructions that need to be resolved before they can access memory 318 as discussed above. For example, a load can access the memory only after it is confirmed there are no prior stores that refer to the same address. It is, once again, the ROB 308 that is used to keep track of the various dependencies between the stores and the loads.
Further, in conventional OOO processors, the scheduler 372 can also comprise an index array 340 that the ROB 308 communicates with in order to track the various dependencies. The index array 340 is used to store tags that the ROB 308 assigns to all load and store instructions that are dispatched from IQ 306. These tags are used to designate slots in the LSQ 316 for the store instructions, so that the instructions can be allocated in the LSQ 316 in program order. This, in turn, allows memory 318 to be accessed by the store instructions in program order. As a result, in conventional OOO processors, additional storage can be required within the scheduler 372 for an index array 340 that stores tags for the respective locations of store instructions in the LSQ. Further, additional communication overhead is required to communicate tag related information between the scheduler 372 and LSQ 316.
A significant disadvantage to conventional OOO processor methods of addressing memory operation disambiguation then is that using the scheduler 372 with index array 340, for example, to track dependencies in the LSQ introduces unnecessary delay because of the communication latency involved with communicating tags between the LSQ 316 and scheduler 372. Further, the requirements to queue stores in LSQ 316 in program order in conventional OOO processor designs results in additional computational penalty. Finally, in conventional OOO processors, LSQ 316 may need to constantly check a load operation to determine the completion status of all older stores with respect to the load operation, which can also introduce additional computational cost.