1. Field of the Invention
This invention is related to the field of processors and, more particularly, to memory dependency checking and forwarding of store data to subsequent loads.
2. Description of the Related Art
Superscalar processors attempt to achieve high performance by issuing and executing multiple instructions per clock cycle and by employing the highest possible clock frequency consistent with the design. Over time, the number of instructions concurrently issuable and/or executable by superscalar processors has been increasing in order to increase the performance of superscalar processors.
Unfortunately, as more instructions are executed concurrently, it becomes more important to rapidly process loads. Loads are accesses to external memory (as opposed to internal registers) in which the data stored at the memory location accessed by the load is transferred into the processor (e.g. into an internal register). By contrast, stores are accesses to external memory in which data produced by the processor is stored into the memory location accessed by the store. While loads and stores are defined to access external memory, one or more caches internal to the processor may be employed to decrease memory latency for accesses which hit in the caches.
Since loads transfer data from memory into the processor, typically so that the data may be operated upon by subsequent instruction operations, it is important to process the loads rapidly in order to provide the data to the subsequent instruction operations. If the data is not provided rapidly, the subsequent instruction operations stall. If other instructions are not available for scheduling for execution, overall instruction throughput may decrease (and may accordingly reduce performance). As superscalar processors attempt to issue/execute larger numbers of instructions concurrently, these effects may increase. Accordingly, the need for rapid load processing may increase as well.
Additionally, the increase in number of instructions concurrently issued/executed in a processor may lead to an increase in the number of stores residing in a store queue, on average. Typically, stores are not committed to memory (cache or external) until after the stores are known to be non-speculative. For example, stores may not be committed until retired. The stores are placed in the store queue, including a store address generated using the address operands of the store and the data to be stored, until the stores can be committed to memory.
While a larger number of stores in the store queue may not present a performance problem alone, the larger number of stores may indirectly present a performance problem for the rapid processing of loads. As the number of stores within the store queue increases, the likelihood that data accessed by a load is in the store queue (as opposed to the cache/external memory) increases. Furthermore, the likelihood that some bytes accessed by the load are modified by one preceding store in the store queue while other bytes accessed by the load are modified by another preceding store in the store queue may increase as well. Even further, the likelihood that store data to be used by the load is not available in the store queue increases. The more frequently these events occur, the larger the barrier to rapid load processing may become.