1. Field of the Invention
This invention is related to the field of processors and, more particularly, to dependency checking and forwarding from a store queue within processors.
2. Description of the Related Art
Processors often include store queues to buffer store memory operations which have been executed but which are still speculative and/or have been retired but not yet committed to memory. The store memory operations may be held in the store queue until they are retired. Subsequent to retirement, the store memory operations may be committed to the cache and/or memory. As used herein, a memory operation is an operation specifying a transfer of data between a processor and a main memory (although the transfer may be completed in cache). Load memory operations specify a transfer of data from memory to the processor, and store memory operations specify a transfer of data from the processor to memory. Memory operations may be an implicit part of an instruction which includes a memory operation, or may be explicit load/store instructions. Load memory operations may be more succinctly referred to herein as xe2x80x9cloadsxe2x80x9d. Similarly, store memory operations may be more succinctly referred to as xe2x80x9cstoresxe2x80x9d.
While executing stores speculatively and queueing them in the store queue may allow for increased performance in a number of fashions (e.g. by providing early store address calculation for detecting load-hit-store scenarios, by allowing for cache line fills to be started if a store misses and the cache is operating in a write allocate mode, and/or by removing the stores from the instruction execution pipeline and allowing other, subsequent instructions to execute), subsequent loads may access the memory locations updated by the stores in the store queue. While processor performance is not necessarily directly affected by having stores queued in the store queue, performance may be affected if subsequent loads are delayed due to accessing memory locations updated by stores in the store queue.
Still further, loads and stores may generally access arbitrary bytes within memory. Thus, it is possible that a given load may access one or more bytes updated by one store in the store queue and one or more additional bytes updated by another store in the store queue. As used herein, a store queue entry storing a store memory operation is referred to as being xe2x80x9chitxe2x80x9d by a load memory operation if at least one byte updated by the store memory operation is accessed by the load memory operation. The circuitry for detecting the various cases of loads hitting one or more store queue entries may be quite complex. Thus, the circuitry may occupy a large area of semiconductor substrate and/or may increase latency in performing loads. A mechanism for correctly handling loads hitting in the store queue, which conserves the amount of circuitry used and decreases average load latency, is desired.
It is noted that loads, stores, and other instruction operations may be referred to herein as being older or younger than other instruction operations. A first instruction is older than a second instruction if the first instruction precedes the second instruction in program order (i.e. the order of the instructions in the program being executed). A first instruction in younger than a second instruction if the first instruction is subsequent to the second instruction in program order.
The problems outlined above are in large part solved by a processor including a store queue as described herein. The store queue is configured to detect a hit on a store queue entry for a load being executed by the processor, and to forward data from the store queue entry to provide a result for the load. The store queue data is provided to the data cache, along with an indication of how much data is being provided (e.g. byte enables). The data cache may then fill in any additional data accessed by the load from cache data, and provide a load result. Additionally, the store queue is configured to detect if more than one store queue entry is hit (i.e. that more than one store within the store queue updates at least one byte accessed by the load), referred to as a multimatch. If a multimatch is detected, the store queue may signal a retry of the load. Subsequently, the load may be reexecuted and may not multimatch (as entries are deleted upon completion of the corresponding stores). The load may complete when it does not multimatch. The combination of forwarding from the youngest store (which is older than the load) and retrying on multimatch cases may, in one embodiment, provide for less complicated store queue forwarding circuitry while still allowing for store queue forwarding (which may decrease average load latency).
In one embodiment, the store queue independently detects hits on the upper and lower portions of each store queue entry (e.g. doubleword portions) and forwards from the upper and lower portions independently. Thus, a load may hit one store queue entry for the lower portion of the data accessed by the load and a different store queue entry for the upper portion of the data accessed by the load without multimatch detection. Such a configuration may optimize code sequences in which two separate stores update the upper and lower portions and a subsequent load accesses both the upper and lower portions without substantially complicating the store queue forwarding circuitry. Thus, the optimized code sequence may achieve lower average load latency.
Broadly speaking, a store queue is contemplated. The store queue comprises a first buffer and a multimatch circuit. The first buffer includes at least a first entry and a second entry, wherein each entry is configured to store information corresponding to a store memory operation. Additionally, the first buffer includes circuitry configured to assert a first match signal in response to detecting a load memory operation hitting the first entry and further configured to assert a second match signal in response to the load memory operation hitting the second entry. Coupled to receive the first match signal and the second match signal, the multimatch circuit is configured to assert a multimatch signal responsive to an assertion of both the first match signal and the second match signal. Additionally, a processor is contemplated comprising the store queue and a data cache coupled to the store queue. The data cache is configured to merge cache data with store queue data to produce load data corresponding to the load memory operation. Still further, a computer system is contemplated including the processor and an input/output (I/O) device configured to communicate between the computer system and another computer system to which the I/O device is couplable
Moreover, a method is contemplated. Load information corresponding to a load memory operation is received in a store queue, the store queue including a plurality of entries, each of the plurality of entries configured to store information corresponding to a store memory operation. A multimatch signal is asserted in response to the load memory operation hitting two or more of the plurality of entries. The load memory operation is retried responsive to asserting the multimatch signal.