1. Field of the Invention
This invention is related to processors and, more particularly, load hit store detection mechanisms within load/store units of processors.
2. Description of the Related Art
Processors are more and more being designed using techniques to increase the number of instructions executed per second. Superscalar techniques involve providing multiple execution units and attempting to execute multiple instructions in parallel. Pipelining, or superpipelining, techniques involve overlapping the execution of different instructions using pipeline stages. Each stage performs a portion of the instruction execution process (involving fetch, decode, execution, and result commit, among others), and passes the instruction on to the next stage. While each instruction still executes in the same amount of time, the overlapping of instruction execution allows for the effective execution rate to be higher. Typical processors employ a combination of these techniques and others to increase the instruction execution rate.
As processors employ wider superscalar configurations and/or deeper instruction pipelines, memory latency becomes an even larger issue than it was previously. While virtually all modem processors employ one or more caches to decrease memory latency, even access to these caches is beginning to impact performance.
More particularly, as processors allow larger numbers of instructions to be in-flight within the processors, the number of load and store memory operations which are in-flight increases as well. As used here, an instruction is xe2x80x9cin-flightxe2x80x9d if the instruction has been fetched into the instruction pipeline (either speculatively or non-speculatively) but has not yet completed execution by committing its results (either to architected registers or memory locations). Additionally, the term xe2x80x9cmemory operationxe2x80x9d is an operation which specifies a transfer of data between a processor and memory (although the transfer may be accomplished in cache). Load memory operations specify a transfer of data from memory to the processor, and store memory operations specify a transfer of data from the processor to memory. Load memory operations may be referred to herein more succinctly as xe2x80x9cloadsxe2x80x9d, and similarly store memory operations may be referred to as xe2x80x9cstoresxe2x80x9d. Memory operations may be implicit within an instruction which directly accesses a memory operand to perform its defined function (e.g. arithmetic, logic, etc.), or may be an explicit instruction which performs the data transfer only, depending upon the instruction set employed by the processor. Generally, memory operations specify the affected memory location via an address generated from one or more operands of the memory operation. This address will be referred to herein in as a xe2x80x9cdata addressxe2x80x9d generally, or a load address (when the corresponding memory operation is a load) or a store address (when the corresponding memory operation is a store). On the other hand, addresses which locate the instructions themselves within memory are referred to as xe2x80x9cinstruction addressesxe2x80x9d.
Since memory operations are part of the instruction stream, having more instructions in-flight leads to having more memory operations in-flight. Unfortunately, adding additional ports to the data cache to allow more operations to occur in parallel is generally not feasible beyond a few ports (e.g. 2) due to increases in both cache access time and area occupied by the data cache circuitry. Accordingly, relatively larger buffers for memory operations are often employed. Scanning these buffers for memory operations to access the data cache is generally complex and, accordingly, slow. The scanning may substantially impact the load memory operation latency, even for cache hits.
Additionally, data caches are finite storage for which some load and stores will miss. A memory operation is a xe2x80x9chitxe2x80x9d in a cache if the data accessed by the memory operation is stored in cache at the time of access, and is a xe2x80x9cmissxe2x80x9d if the data accessed by the memory operation is not stored in cache at the time of access. When a load memory operation misses a data cache, the data is typically loaded into the cache. Store memory operations which miss the data cache may or may not cause the data to be loaded into the cache. Data is stored in caches in units referred to as xe2x80x9ccache linesxe2x80x9d, which are the minimum number of contiguous bytes to be allocated and deallocated storage within the cache. Since many memory operations are being attempted, it becomes more likely that numerous cache misses will be experienced. Furthermore, in many common cases, one miss within a cache line may rapidly be followed by a large number of additional misses to that cache line. These misses may fill, or come close to filling, the buffers allocated within the processor for memory operations. Accordingly, relatively deep buffers may be employed.
An additional problem which becomes even more onerous as processors employ wider superscalar configurations and/or deeper pipelines is the issue of store to load forwarding. As more memory operations may be queued up prior to completion, it becomes more likely that load memory operations will hit prior store memory operations still in the buffers. Furthermore, as speculative instruction execution increases due to the larger number of instructions in-flight within the processor, it becomes more likely that loads will hit on multiple stores within the buffers. Complex hit prioritization logic is generally used to determine which of the multiple stores is the youngest store (i.e. the store nearest the load in program order, and hence the store which generates the data that the load should receive). The complex logic may induce additional delay, increasing the load latency. A method for rapidly forwarding store data to dependent loads is needed.
The problems outlined above are in large part solved by a load/store unit as describe herein. The load/store unit includes a buffer configured to retain store memory operations which have probed the data cache. Additionally, each entry in the buffer includes a last-in-buffer (LIB) indication which identifies whether or not the store in that entry is the last store in the buffer (i.e. the youngest store) to update the memory locations specified by the corresponding store address. Load addresses are compared to the store addresses, and the comparison result is qualified with the corresponding LIB indication such that only the youngest store is identified as a hit. Since at most one load hit store is detected, complex prioritization logic may be eliminated. Load latency may be reduced due to the elimination of the prioritization logic from the load""s cache access/dependency check logic.
In one embodiment, the buffer also stores loads which have probed the data cache. Loads may reprobe from the buffer after a subsequent store has been placed in the buffer. To properly associate the load with the youngest store which is older than the load, the buffer records the store instruction tag of a store which is hit by the load during the initial probe (according to the LIB indications during the initial probe). During reprobes, the LIB indications are ignored and instead the store instruction tags are compared to the recorded store instruction tag to detect the store which is hit by the load.
As stores are inserted into the buffer, the store address is compared in the same fashion as the load address. If a hit is detected, the LIB indication for the hit store is set to a state indicating that the hit store in not the last in buffer to update the corresponding store address. The LIB indication for the newly inserted store is set to a last-in-buffer state indicating that that newly inserted store is the last in buffer to update the corresponding store address. In this manner, one LIB indication per different address in the buffer is in the last-in-buffer state.
It is noted that the LIB indication may be maintained and used based on address ranges, rather than a full address compare. This may allow more rapid forwarding and may ease implementation. A subsequent, more accurate check of the addresses may be used to determine if the correct data was forwarded and to take corrective action if incorrect data was forwarded.
Broadly speaking, a load/store unit is contemplated. The load/store unit comprises a buffer including a plurality of entries and control logic coupled to the buffer. Each of the plurality of entries is configured to store a data address and a last-in-buffer (LIB) indication. The LIB indication, in a first state, is indicative that a corresponding store memory operation is a youngest store memory operation within the buffer to update a memory location identified by the data address. The control logic is also coupled to receive a first data address probing a data cache. The control logic is configured to identify a first entry of the plurality of entries for which: (i) the data address stored in the first entry matches the first data address, and (ii) the LIB indication stored in the first entry is in the first state. Additionally, a processor is contemplated. The processor comprises the load/store unit coupled to a data cache. The data cache is configured to store data. Still further, a computer system is contemplated. The computer system comprises the processor and an I/O device coupled to the processor. The I/O device is configured to communicate between the computer system and another computer system to which the I/O device is coupled.
Moreover, a method is contemplated. A data cache is probed with a first data address corresponding to a first memory operation. A first entry is identified within a buffer of memory operations. The first entry stores a second data address of a second memory operation and a last-in-buffer (LIB) indication. Identifying the first entry includes: (i) determining that the second address matches the first address; and (ii) determining that the LIB indication is in a first state indicative that the second memory operation comprises a store memory operation which is youngest in the buffer to update a memory location identified by the second address.