1. Field of the Invention
This invention is related to the field of processors and, more particularly, to store to load forward mechanisms within processors.
2. Description of the Related Art
Processors often include store queues to buffer store memory operations which have been executed but which are still speculative. The store memory operations may be held in the store queue until they are retired. Subsequent to retirement, the store memory operations may be committed to the cache and/or memory. As used herein, a memory operation is an operation specifying a transfer of data between a processor and a main memory (although the transfer may be completed in cache). Load memory operations specify a transfer of data from memory to the processor, and store memory operations specify a transfer of data from the processor to memory. Memory operations may be an implicit part of an instruction which includes a memory operation, or may be explicit load/store instructions. Load memory operations may be more succinctly referred to herein as xe2x80x9cloadsxe2x80x9d. Similarly, store memory operations may be more succinctly referred to as xe2x80x9cstoresxe2x80x9d.
While executing stores speculatively and queueing them in the store queue may allow for increased performance (by removing the stores from the instruction execution pipeline and allowing other, subsequent instructions to execute), subsequent loads may access the memory locations updated by the stores in the store queue. While processor performance is not necessarily directly affected by having stores queued in the store queue, performance may be affected if subsequent loads are delayed due to accessing memory locations updated by stores in the store queue. Often, store queues are designed to forward data stored therein if a load hits the store queue. As used herein, a store queue entry storing a store memory operation is referred to as being xe2x80x9chitxe2x80x9d by a load memory operation if at least one byte updated by the store memory operation is accessed by the load memory operation.
To further increase performance, it is desirable to execute younger loads out of order with respect to older stores. The younger loads may often have no dependency on the older stores, and thus need not await the execution of the older stores. Since the loads provide operands for execution of dependent instructions, executing the loads allows for still other instructions to be executed. However, merely detecting hits in the store queue as loads are executing may not lead to correct program execution if younger loads are allowed to execute out of order with respect to older stores, since certain older stores may not have executed yet (and thus the store addresses of those stores may not be known and dependencies of the loads on the certain older stores may not be detectable as the loads are executed). Accordingly, hardware to detect scenarios in which a younger load executes prior to an older store on which that younger load is dependent may be required, and then corrective action may be taken in response to the detection. For example, instructions may be purged and refetched or reexecuted in some other suitable fashion. As used herein, a load is xe2x80x9cdependentxe2x80x9d on a store if the store updates at least one byte of memory accessed by the load, is older than the load, and is younger than any other stores updating that byte. Unfortunately, executing the load out of order improperly and the subsequent corrective actions to achieve correct execution may reduce performance.
It is noted that loads, stores, and other instruction operations may be referred to herein as being older or younger than other instruction operations. A first instruction is older than a second instruction if the first instruction precedes the second instruction in program order (i.e. the order of the instructions in the program being executed). A first instruction is younger than a second instruction if the first instruction is subsequent to the second instruction in program order.
The problems outlined above are in large part solved by a processor as described herein. The processor generally may schedule and/or execute younger loads ahead of older stores. Additionally, the processor may detect and take corrective action for scenarios in which an older store interferes with the execution of the younger load. The processor employs a store to load forward (STLF) predictor which may indicate, for dispatching loads, a dependency on a store. The dependency is indicated for a store which, during a previous execution, interfered with the execution of the load. Since a dependency is indicated on the store, the load is prevented from scheduling and/or executing prior to the store. Performance may be increased due to the decreased interference between loads and stores.
The STLF predictor is trained with information for a particular load and store in response to executing the load and store and detecting the interference. Additionally, the STLF predictor may be untrained (e.g. information for a particular load and store may be deleted) if a load is indicated by the STLF predictor as dependent upon a particular store and the dependency does not actually occur. For example, in one embodiment, the STLF predictor is untrained if the load is indicated as dependent upon the particular store but store data is not forwarded from a store queue within the processor when the load executes.
In one implementation, the STLF predictor records at least a portion of the PC of a store which interferes with the load in a first table indexed by the load PC. A second table maintains a corresponding portion of the store PCs of recently dispatched stores, along with tags identifying the recently dispatched stores. The PC of a dispatching load is used to select a store PC from the first table. The selected store PC is compared to the PCs stored in the second table. If a match is detected, the corresponding tag is read from the second table and used to indicate a dependency for the load.
In another implementation, the STLF predictor records a difference between the tags assigned to a load and a store which interferes with the load in a first table indexed by the load PC. The PC of the dispatching load is used to select a difference from the table, and the difference is added to the tag assigned to the load. Accordingly, a tag of the store may be generated and a dependency of the load on the store may be indicated.
Broadly speaking, a processor is contemplated comprising an STLF predictor and an execution pipeline coupled to the STLF predictor. The STLF predictor is coupled to receive an indication of dispatch of a first load memory operation, and is configured to indicate a dependency of the first load memory operation on a first store memory operation responsive to information stored within the STLF predictor indicating that, during a previous execution, the first store memory operation interfered with the first load memory operation. The execution pipeline is configured to inhibit execution of the first load memory operation prior to the first store memory operation responsive to the dependency. The execution pipeline is configured to detect a lack of the dependency during execution of the first load memory operation. The execution pipeline is configured to generate an untrain signal responsive to the lack of dependency. Coupled to receive the untrain signal, the STLF predictor is configured to update the information stored therein to not indicate that the first store memory operation interfered with the first load memory operation during the previous execution. Additionally, a computer system is contemplated including the processor and an input/output (I/O) device configured to communicate between the computer system and another computer system to which the I/O device is couplable.
Moreover, a method is contemplated. A dependency of a first load memory operation on a first store memory operation is indicated responsive to information indicating that, during a previous execution, the first store memory operation interfered with the first load memory operation. Scheduling of the first load memory operation is inhibited prior to scheduling the first store memory operation. A lack of the dependency is detected during execution of the first load memory operation. The information indicating that, during the previous execution, the first store memory operation interfered with the first load memory operation is updated to not indicate that, during the previous execution, the first store memory operation interfered with the first load memory operation. The updating is performed responsive to the detecting of the lack of dependency.