Microprocessors have a load instruction that loads data from a source memory location to a register of the microprocessor and a store instruction that stores data from a register of the microprocessor to a destination memory location. Commonly, the microprocessor will encounter a load instruction that specifies a source memory address that overlaps with the destination memory address of an older store instruction. That is, the older store instruction is writing data to a memory address from which the load instruction is reading. This situation is commonly referred to as a store collision. In order to achieve correct program execution in the presence of a store collision, the microprocessor must insure that the load instruction receives the data written by the older address-overlapping store instruction.
Out-of-order execution microprocessors execute instructions out of the program order. This can be problematic in the context of a store collision because the load instruction may be issued for execution before the older store instruction, thereby causing the load instruction to receive incorrect data. In such a case, the load instruction must not be allowed to retire the load data to its architectural destination register. Rather, the load instruction must receive the correct store data and retire the correct data to the destination register.
One way to cause the load instruction to receive the correct store data is to perform a replay. That is, the microprocessor detects the situation described above and forces the load instruction to be re-issued and re-executed after the store instruction has executed. Upon subsequent execution, the load instruction will receive the correct store data since the store instruction has been executed.
However, replays can be relatively expensive, particularly in microprocessors that are deeply pipelined. First, the store instruction may be dependent on other instructions—indeed, the store instruction may be at the end of a long chain of dependencies—such that it may not execute for potentially many clock cycles; thus, the load instruction must wait potentially many clock cycles before it can be replayed. The larger the number of clock cycles that the load instruction must wait to be replayed, the larger the penalty to process the load instruction. Additionally, the load instruction must pass back through the relevant pipeline stages again when it is re-issued and re-executed, which takes more clock cycles. The larger the number of pipeline stages that the load instruction must pass back through, the larger the penalty in terms of number of clock cycles to process the load instruction.
The system of U.S. Pat. No. 6,006,326 issued to Panwar et al. attempts to address this problem by employing a special array that stores color bits associated with load and store instructions. The color bits array includes entries corresponding to the instruction cache entries and is read when a load or store instruction is read from the instruction cache. The color bits of an entry in the array are updated to the same color value to indicate a dependency between a load and store instruction in response to a replay that was caused by issuing the load ahead of the store. When the store and load instructions are again placed in the pipeline for execution, the dependency checking logic detects that they have the same color and reports the dependency to the instruction scheduling logic, which does not schedule the load instruction until the similarly colored store instruction has been scheduled.
Because the color bits array must store color bits for each load and store instruction in the instruction cache, the size of the color bits array is a function of the instruction cache size. Thus, a potential disadvantage of the color bits array is that it may require a significant amount of storage space on the microprocessor since the number of entries of the instruction cache is typically relatively large. A relatively large color bits array may consume significant amounts of power and real estate space of the microprocessor.
Therefore, what is needed is an improved mechanism for reducing the number of load instruction replays in the presence of store collisions in an out-of-order execution microprocessor.