In an Out-Of-Order (“OOO”) microprocessor, instructions are allowed to issue and execute out of their program order. For example, the scheduler in an OOO processor can issue and execute a load before a previous store writes to the same memory location. Memory dependencies can therefore exist between such a load and a prior store that needs to access the same memory address. The OOO machine typically needs to address the errors that result from a load returning wrong data because it attempts to access a memory location before a prior occurring store has written to the same memory location. This type of error is known as a memory read-after-write (RAW) violation. Further, the load and store pair is typically referred to as “aliasing” because it is targeting the same memory address.
FIGS. 1A-1C illustrate certain typical problems that can arise as a result of out-of-order execution of loads and stores in an 000 machine. FIG. 1A shows that a later load instruction 152 loads data from the same memory address [0x4000] 153 that is referenced in a previous store instruction. Accordingly, the load 152 should store the same value in register r3 as that stored in register r5 by store instruction 151. As shown in FIG. 1B, if the load instruction 152 is executed before the store, it will load incorrect data. This is known as a RAW violation as indicated above. In order to recover from the violation, the pipeline will need to be flushed and the load instruction along with other instructions dependent on it will need to re-execute. Because of the high computational penalty of a flush operation, it is important to avoid this kind of re-execution from happening in a high performance CPU.
Another type of common problem that results from out-of-order execution of loads and stores is a read-after-write (RAW) delay. FIG. 1C illustrates a load instruction 152 that is executed after store instruction 151, but the store instruction 151 is blocked by a long latency memory access resulting from load instruction 162. This results in a RAW delay.
A store has both store address (SA) and store data (SD) components. It is possible a SA, e.g., [0x4000] as shown in FIG. 1C can be issued well before a SD, e.g., r5 in FIG. 1C, because the SD is waiting for another register source to be ready, e.g., r5 from load instruction 162. Store instruction 151 cannot execute until the proper value is loaded into register r5 by load instruction 162. It is, therefore, important to avoid executing load 152 while store instruction 151 is waiting on load instruction 162 so that resources in the pipeline, e.g., the load store queue (LSQ) can be used by other loads which do not have to wait.
FIG. 2 illustrates a more detailed example of how a conventional OOO microprocessor handles a memory read-after-write (“RAW”) violation. Instruction 1 257, Instruction 2 258, Instruction 3 259, and Instruction 4 260 are in program order. However, in an OOO machine, Instruction 3 259, the load instruction, can execute during cycle 2 before Instruction 2 258, which is a store instruction that executes in cycle 4 and accesses the same memory location [0x4000] as the load instruction 259. If load instruction 259 executes in a prior cycle from the store instruction 258, it will get wrong data from memory location [0x4000]. Accordingly, the wrong data will be stored in register r9 by load instruction 259. Further, Instruction 4 260 may execute in cycle 3 using the wrong data from the load instruction 259.
In order to correct the errors resulting from this RAW violation, both instructions 259 and 260 are invalidated and need to re-execute following a pipeline flush. The load instruction 259 will receive the correct data from the store instruction 258 during the re-execution, however, a severe computational penalty is paid in order to perform the pipeline flush and re-execution.
Conventional methods of addressing the issues associated with RAW violations are problematic because, as will be explained in connection with FIG. 3, they have no way of tracking explicit dependence information between loads and their aliasing stores and, accordingly, result in unnecessary delays. Further, conventional OOO microprocessors lack any effective means of preventing memory RAW delays. FIG. 3 illustrates an exemplary pipeline for a conventional OOO microprocessor. Instructions are fetched at the fetch stage 302 and placed in the instruction fetch queue (IFQ) (not shown) within fetch stage 302. The instructions are generally the original assembly instructions found in the executable program.
These instructions reference the architectural registers, which are stored in register file 310. If the first fetched instruction was to be interrupted or raise an exception, the architectural register file 310 stores the results of all instructions until that point. Stated differently, the architectural register file stores the state that needs to be saved and restored in order to return back to the program during debugging or otherwise.
In an OOO microprocessor, the instructions execute out of order while still preserving data dependence constraints. Because instructions may finish in an arbitrary order, the architectural register file 310 cannot be modified by the instructions as they finish because it would make it difficult to restore their values accurately in the event of an exception or an interrupt. Hence, every instruction that enters the pipeline is provided a temporary register where it can save its result. The temporary registers are eventually written into the architectural register file in program order. Thus, even though instructions are being executed out of order, the contents of the architectural register files change as though they were being executed in program order.
The ROB 308 can facilitate this process. After the instructions are dispatched from the fetch unit 302, they are decoded by decode module 304 and are placed in the ROB 308 and issue queue 306 (IQ). The ROB 308 and IQ 306 may be part of a scheduler module 372. As scheduler module 372 issues or dispatches instructions out of IQ 306 out of order, they are executed by execute module 312.
The write back module 314, in a conventional OOO micro-architecture will write the resulting values from those instructions back to the temporary registers in ROB 308 first. The ROB 308 keeps track of the program order in which instructions entered the pipeline and for each of these instructions, the ROB maintains temporary register storage. When the oldest instructions in the ROB produce a valid result, those instructions can be safely “committed.” That is, the results of those instructions can be made permanent since there is no earlier instruction that can raise a mispredict or exception that may undo the effect of those instructions. When instructions are ready to be committed, the ROB 308 will move the corresponding values in the temporary registers for those instructions to the architectural register file 310. Therefore, through the ROB's in-order commit process, the results in the register file 310 are made permanent and architecturally visible.
The instructions issued out of order from the IQ 306 may also comprise loads and stores. As explained above, when loads and stores are issued out of order from the IQ 306, there are memory dependencies between them that need to be resolved before those instructions can be committed. Accordingly, the load and stores instructions are stored in a Load Store Queue (LSQ) 316 while the dependencies between them are resolved with the help of ROB 308.
Conventional 000 machines handle RAW violations by using, for example, a Store to Load Predictor module 356. Store to Load Predictor module 356 is used to predict data dependencies between loads and previous stores. If a RAW violation takes place, the PC of the problematic load is stored in a table in module 356. Subsequently, if scheduler 372 attempts to issue a load out of order from IQ 306, it will check the table in module 356 to make sure that the PC of the load does not match any entry in the table. If the load does match a prior entry in the table, then the scheduler will ensure that the load is not issued until all stores prior to the load are issued. This is inefficient because not all previous stores will be relevant to a problematic load. Only the stores that access the same memory location (e.g., of memory 318) as the problematic load, e.g., only the aliasing stores need to be issued prior to the load. However, schedulers in conventional OOO processors do not have visibility into the memory locations accessed by the load and store instructions and, therefore, cannot discriminate between the prior stores.
Further, as discussed above, conventional OOO processors lack any effective means of preventing memory RAW delays. To guarantee correctness in the Load Store Queue (LSQ), when a conventional OOO processor finds an aliasing store in the Store Queue (SQ) still in the process of writing data that a problematic load, e.g. instruction 152 in FIG. 1C, wants to read, the LSQ will send a dependent throughput miss to the scheduler for this load, effectively putting this load into sleep state. Subsequently, the LSQ will retry the load once store data is ready. There are obvious performance and power costs associated with these retries. For example, the problematic load 152 will occupy space in the LSQ, which could otherwise have been used by another load that did not have to wait for an aliasing store.