Programs frequently use store and load instructions. A store instruction moves data from a register of the processor to memory, and a load instruction moves data from memory to a register of the processor. Frequently microprocessors execute instruction streams where one or more store instructions precede a load instruction, where the data for the load instruction is at the same memory location as one or more of the preceding store instructions. In these cases, in order to correctly execute the program, the microprocessor must ensure that the load instruction receives the store data produced by the newest preceding store instruction. One way to accomplish correct program execution is for the load instruction to stall until the store instruction has written the data to memory (i.e., system memory or cache), and then the load instruction reads the data from memory. However, this is not a very high performance solution. Therefore, modern microprocessors transfer the store data from the functional unit in which the store instruction resides (e.g., a store queue) to the functional unit in which the load instruction resides (e.g., a load unit). This is commonly referred to as a store forward operation or store forwarding or store-to-load forwarding.
In order to detect whether it needs to forward store data to a load instruction, the microprocessor compares the load memory address with the store memory addresses of older store instructions to see whether they match. For strict accuracy, the microprocessor needs to compare the physical address of the load with the physical address of the stores. However, translating the load virtual address into the load physical address takes time. So, in order to avoid delaying the address comparison, a modern microprocessor compares the load virtual address with the older store virtual addresses in parallel with the translation of the load virtual address to the load physical address and store forwards based on the virtual address comparison. The microprocessor then performs the physical address comparison to verify that the store forwarding based on the virtual address comparison was correct or to determine the forwarding was incorrect and correct the mistake by replaying the load.
Furthermore, because a compare of the full virtual addresses is time consuming (as well as power and chip real estate consuming) and may affect the maximum clock frequency at which the microprocessor may operate, modern microprocessors tend to compare only a portion of the virtual address, rather than comparing the full virtual address. This may cause increased false store collision detections and increased incorrect forwarding. One solution to this problem is described in U.S. patent application Ser. No. 12/197,632 (CNTR.2405), filed Aug. 25, 2008, which is hereby incorporated by reference. However, more accurate ways of detecting store collisions for the purpose of store forwarding are still needed.
Additionally, the time required to perform store forwarding using the virtual address comparison-based scheme may be hidden by the virtual-to-physical address translation time (i.e., TLB lookup time) and the cache tag and data array lookup time. However, if that becomes no longer true, then what will be needed is an alternate way to detect store collisions for the purpose of store forwarding.
Finally, the virtual address comparison-based store collision detection scheme requires a relatively large number of address comparators, which consume a relatively large amount of space on the microprocessor die and power. Therefore, what is needed is a more die real estate and power consumption efficient way to detect store collisions for the purpose of store forwarding.