1. Field of the Invention
This invention is related to the field of processors and, more particularly, to load/store units within processors.
2. Description of the Related Art
Processors are more and more being designed using techniques to increase the number of instructions executed per second. Superscalar techniques involve providing multiple execution units and attempting to execute multiple instructions in parallel. Pipelining, or superpipelining, techniques involve overlapping the execution of different instructions using pipeline stages. Each stage performs a portion of the instruction is execution process (involving fetch, decode, execution, and result commit, among others), and passes the instruction on to the next stage. While each instruction still executes in the same amount of time, the overlapping of instruction execution allows for the effective execution rate to be higher. Typical processors employ a combination of these techniques and others to increase the instruction execution rate.
As processors employ wider superscalar configurations and/or deeper instruction pipelines, memory latency becomes an even larger issue than it was previously. While virtually all modem processors employ one or more caches to decrease memory latency, even access to these caches is beginning to impact performance.
More particularly, as processors allow larger numbers of instructions to be in-flight within the processors, the number of load and store memory operations which are in-flight increases as well. As used here, an instruction is xe2x80x9cin-flightxe2x80x9d if the instruction has been fetched into the instruction pipeline (either speculatively or non-speculatively) but has not yet completed execution by committing its results (either to architected registers or memory locations). Additionally, the term xe2x80x9cmemory operationxe2x80x9d is an operation which specifies a transfer of data between a processor and memory (although the transfer may be accomplished in cache). Load memory operations specify a transfer of data from memory to the processor, and store memory operations specify a transfer of data from the processor to memory. Load memory operations may be referred to herein more succinctly as xe2x80x9cloadsxe2x80x9d, and similarly store memory operations may be referred to as xe2x80x9cstoresxe2x80x9d. Memory operations may be implicit within an instruction which directly accesses a memory operand to perform its defined function (e.g. arithmetic, logic, etc.), or may be an explicit instruction which performs the data transfer only, depending upon the instruction set employed by the processor. Generally, memory operations specify the affected memory location via an address generated from one or more operands of the memory operation. This address will be referred to herein in as a xe2x80x9cdata addressxe2x80x9d generally, or a load address (when the corresponding memory operation is a load) or a store address (when the corresponding memory operation is a store). On the other hand, addresses which locate the instructions themselves within memory are referred to as xe2x80x9cinstruction addressesxe2x80x9d.
Since memory operations are part of the instruction stream, having more instructions in-flight leads to having more memory operations in-flight. Unfortunately, adding additional ports to the data cache to allow more operations to occur in parallel is generally not feasible beyond a few ports (e.g. 2) due to increases in both cache access time and area occupied by the data cache circuitry. Accordingly, relatively larger buffers for memory operations are often employed. Scanning these buffers for memory operations to access the data cache is generally complex and, accordingly, slow. The scanning may substantially impact the load memory operation latency, even for cache hits.
Additionally, data caches are finite storage for which some load and stores will miss. A memory operation is a xe2x80x9chitxe2x80x9d in a cache if the data accessed by the memory operation is stored in cache at the time of access, and is a xe2x80x9cmissxe2x80x9d if the data accessed by the memory operation is not stored in cache at the time of access. When a load memory operation misses a data cache, the data is typically loaded into the cache. Store memory operations which miss the data cache may or may not cause the data to be loaded into the cache. Data is stored in caches in units referred to as xe2x80x9ccache linesxe2x80x9d, which are the minimum number of contiguous bytes to be allocated and deallocated storage within the cache. Since many memory operations are being attempted, it becomes more likely that numerous cache misses will be experienced. Furthermore, in many common cases, one miss within a cache line may rapidly be followed by a large number of additional misses to that cache line. These misses may fill, or come close to filling, the buffers allocated within the processor for memory operations. An efficient scheme for buffering memory operations is therefore needed.
Another problem which becomes even more difficult as processors employ wider superscalar configurations and/or deeper pipelines is the maintenance of strong memory ordering. Some instruction set architectures require strong ordering (e.g. the x86 instruction set architecture). Generally, memory operations are strongly ordered if they appear to have occurred in the program order specified. While performing stores in program order may not cause much performance impact (because the store data can be forwarded to subsequent loads and other instructions are generally not directly dependent upon stores), performing loads in order may have a large impact. For example, a load may miss the data cache, and subsequent loads may be capable of hitting in the data cache. Performance may be gained by allowing the load hits to proceed, forwarding data to dependent instructions which may then execute. However, if the load hits are allowed to proceed while the load miss is being serviced, it is possible to violate strong memory ordering rules.
For example, if a first processor performs a store to address A1 followed by a store to address A2 and a second processor performs a load to address A2 (which misses in the data cache of the second processor) followed by a load to address A1 (which hits in the data cache of the second processor), strong memory ordering rules may be violated. Strong memory ordering rules require, in the above situation, that if the load to address A2 receives the store data from the store to address A2, then the load to address A1 must receive the store data from the store to address A1. However, if the load to address A1 is allowed to complete while the load to address A2 is being serviced, then the following scenario may occur: (i) the load to address A1 may receive data prior to the store to address A1; (ii) the store to address A1 may complete, (iii) the store to address A2 may complete, and (iv) the load to address A2 may complete and receive the data provided by the store to address A2. This outcome would be incorrect. A mechanism for allowing load hits to proceed while load misses are service which maintains strong ordering rules is desired.
The problems outlined above are in large part solved by a processor employing a post-cache (LS2) buffer as described herein. Loads are stored into the LS2 buffer after probing the data cache. The load/store unit snoops the loads in the LS2 buffer against snoop requests received from an external bus. If a snoop invalidate request hits a load within the LS2 buffer and that load hit in the data cache during its initial probe, the load/store unit scans the LS2 buffer for older loads which are misses. If older load misses are detected, a synchronization indication is set for the load misses. Subsequently, one of the load misses completes and the load/store unit transmits a synchronization signal with the status for the load miss. The processor synchronizes to the instruction corresponding to the load miss, thereby discarding load hit which was subsequently snoop hit. The discarding instructions are refetched and reexecuted, thereby causing the load hit to reexecute subsequent to an earlier load miss. Advantageously, load hits may generally proceed ahead of load misses and strong memory ordering rules may still be enforced. Performance of the processor may be increased while maintaining correct operation in cases where strong memory ordering rules may be violated.
Broadly speaking, a processor is contemplated comprising a bus interface unit and a load/store unit. The bus interface unit is configured to detect a snoop invalidate operation upon a bus to which the processor is coupled. Coupled to receive a snoop invalidate request from the bus interface unit in response to the snoop invalidate operation, the load/store unit includes a buffer and control logic coupled to the buffer. The buffer is configured to store load memory operations subsequent to the load memory operations probing a data cache. The control logic, responsive to a snoop hit corresponding to the snoop invalidate request on a first load memory operation within the buffer, is configured to set a synchronization indication corresponding to a second load memory operation within the buffer. The second load memory operation is prior to the first load memory operation in program order. A computer system is contemplated comprising the processor and an input/output (I/O) device. The I/O device for provides communication between the computer system and another computer system to which the I/O device is coupled.
Additionally, A method for maintaining ordering of load memory operations is contemplated. A snoop invalidate request is received. A snoop address corresponding to the snoop invalidate request is compared to data addresses in a buffer storing a first load memory operation which hits in a data cache and a second load memory operation prior to the first memory operation in program order which misses the data cache. The snoop address is determined to hit the first memory operation responsive to the compare, and a synchronization indication corresponding to the second load memory operation responsive to determining that the snoop address hits the first memory operation.