1. Field of the Invention
This invention is related to the field of processors and, more particularly, to load/store units within processors.
2. Description of the Related Art
Processors are more and more being designed using techniques to increase the number of instructions executed per second. Superscalar techniques involve providing multiple execution units and attempting to execute multiple instructions in parallel Pipelining, or superpipelining, techniques involve overlapping the execution of different instructions using pipeline stages. Each stage performs a portion of the instruction execution process (involving fetch, decode, execution, and result commit, among others), and passes the instruction on to the next stage. While each instruction still executes in the same amount of time, the overlapping of instruction execution allows for the effective execution rate to be higher. Typical processors employ a combination of these techniques and others to increase the instruction execution rate.
As processors employ wider superscalar configurations and/or deeper instruction pipelines, memory latency becomes an even larger issue than it was previously. While virtually all modem processors employ one or more caches to decrease memory latency, even access to these caches is beginning to impact performance.
More particularly, as processors allow larger numbers of instructions to be in-flight within the processors, the number of load and store memory operations which are in-flight increases as well. As used here, an instruction is xe2x80x9cin-flightxe2x80x9d if the instruction has been fetched into the instruction pipeline (either speculatively or non-speculatively) but has not yet completed execution by committing its results (either to architected registers or memory locations). Additionally, the term xe2x80x9cmemory operationxe2x80x9d is an operation which specifies a transfer of data between a processor and memory (although the transfer may be accomplished in cache). Load memory operations specify a transfer of data from memory to the processor, and store memory operations specify a transfer of data from the processor to memory. Load memory operations may be referred to herein more succinctly as xe2x80x9cloadsxe2x80x9d, and similarly store memory operations may be referred to as xe2x80x9cstoresxe2x80x9d. Memory operations may be implicit within an instruction which directly accesses a memory operand to perform its defined function (e.g. arithmetic, logic, etc.), or may be an explicit instruction which performs the data transfer only, depending upon the instruction set employed by the processor. Generally, memory operations specify the affected memory location via an address generated from one or more operands of the memory operation. This address will be referred to herein in as a xe2x80x9cdata addressxe2x80x9d generally, or a load address (when the corresponding memory operation is a load) or a store address (when the corresponding memory operation is a store). On the other hand, addresses which locate the instructions themselves within memory are referred to as xe2x80x9cinstruction addressesxe2x80x9d.
Since memory operations are part of the instruction stream, having more instructions in-flight leads to having more memory operations in-flight. Unfortunately, adding additional ports to the data cache to allow more operations to occur in parallel is generally not feasible beyond a few ports (e.g. 2) due to increases in both cache access time and area occupied by the data cache circuitry. Accordingly, relatively larger buffers for memory operations are often employed. Scanning these buffers for memory operations to access the data cache is generally complex and, accordingly, slow. The scanning may substantially impact the load memory operation latency, even for cache hits.
Additionally, data caches are finite storage for which some load and stores will miss. A memory operation is a xe2x80x9chitxe2x80x9d in a cache if the data accessed by the memory operation is stored in cache at the time of access, and is a xe2x80x9cmissxe2x80x9d if the data accessed by the memory operation is not stored in cache at the time of access. When a load memory operation misses a data cache, the data is typically loaded into the cache. Store memory operations which miss the data cache may or may not cause the data to be loaded into the cache. Data is stored in caches in units referred to as xe2x80x9ccache linesxe2x80x9d, which are the minimum number of contiguous bytes to be allocated and deallocated storage within the cache. Since many memory operations are being attempted, it becomes more likely that numerous cache misses will be experienced. Furthermore, in many common cases, one miss within a cache line may rapidly be followed by a large number of additional misses to that cache line. These misses may fill, or come close to filling, the buffers allocated within the processor for memory operations. An efficient scheme for buffering memory operations is therefore needed.
An additional problem which becomes even more onerous as processors employ wider superscalar configurations and/or deeper pipelines is the issue of store to load forwarding. As more memory operations may be queued up prior to completion, it becomes more likely that load memory operations will hit prior store memory operations still in the buffers. Furthermore, as speculative instruction execution increases due to the larger number of instructions in-flight within the processor, it becomes more likely that loads will attempt to execute prior to the stores receiving their store data. While loads which hit older stores which have corresponding store data available may receive the corresponding store data from the buffers, loads which hit older stores for which corresponding store data is not available generally are rescheduled at a later time. As the amount of time to schedule the load, execute the load, access the data cache (and detect the hit on the store), and forward the data increases, the delay from the corresponding store data being provided to the data being forwarded as the load data tends to increase. Furthermore, deeper buffers tend to increase the amount of time between scheduling attempts (to allow other memory operations to be scheduled). Performance of the processor, which may be quite dependent on load latency, may therefore suffer. A mechanism for minimizing load delay for loads which hit stores for which store data is unavailable is therefore desired.
The problems outlined above are in large part solved by a processor employing a dependency link file as described herein. Upon detection of a load which hits a store for which store data is not available, the processor allocates an entry within the dependency link file for the load. The entry stores a load identifier identifying the load and a store data identifier identifying a source of the store data. The dependency link file monitors results generated by execution units within the processor to detect the store data being provided. The dependency link file then causes the store data to be forwarded as the load data in response to detecting that the store data is provided. The latency from store data being provided to the load data being forwarded may thereby be minimized. Particularly, the load data may be forwarded without requiring that the load memory operation be scheduled. Performance of the microprocessor may be increased due to the reduced load latency achievable in the above-mentioned cases.
Broadly speaking, a load/store unit is contemplated comprising a first buffer, first control logic coupled to the first buffer, second control logic, and a second buffer coupled to the second control logic. The first buffer comprises a first plurality of entries, each of the first plurality of entries being configured to store a store address and a corresponding store data of a respective store memory operation. The first control logic is configured to detect a first load memory operation having a first load address which hits a first store address within a first entry of the first plurality of entries and for which a first corresponding store data is not stored within the first entry. Coupled to receive a signal from the first control logic indicative of detecting the first load memory operation, the second control logic is configured to allocate a second entry of a second plurality of entries in the second buffer to the first load memory operation in response to the signal. The second-entry is configured to store a first load identifier identifying the first load memory operation and a first store data identifier identifying a source of the first corresponding store data in response to the second control logic allocating the second entry.
A processor is contemplated comprising a data cache and a load/store unit. The load/store unit includes a first buffer comprising a plurality of entries. The load/store unit is configured to allocate a first entry of the plurality of entries to a first load memory operation in response to detecting that a first load address of the first load memory operation hits a first store address of a first store memory operation for which a first store data is not available during a probe of the first load memory operation to the data cache. The first entry stores a first load identifier identifying the first load memory operation and a first store data identifier identifying a source of the first store data. Additionally, a computer system is contemplated comprising the processor and an input/output (I/O) device. The I/O device provides communication between the computer system and another computer system to which the I/O device is coupled.
Moreover, a method for performing a load memory operation is contemplated. A data cache is probed with the load memory operation. The load memory operation is detected as hitting a store memory operation for which corresponding store data is not available during the probing. A load identifier identifying the load memory operation and a store data identifier identifying a source of the store data is recorded to a first buffer. The store data is detected as being provided by receiving the store data identifier from the source. The load identifier is forwarded from the first buffer and the store data is forwarded in response to detecting that the stored data is being provided.