Technical Field
Embodiments described herein relate to computing systems, and more particularly, to methods and mechanisms for reducing the latency of load and store operations in processors.
Description of the Related Art
Processors generally include support for load memory operations and store memory operations to facilitate transfer of data between the processors and memory to which the processors may be coupled. As used herein, a load memory operation is an operation specifying a transfer of data from a main memory to the processor (although the transfer may be completed in cache). A store memory operation is an operation specifying a transfer of data from the processor to memory. Load and store memory operations may be an implicit part of an instruction which includes a memory operation, or may be explicit instructions, in various implementations. Load and store memory operations are more succinctly referred to herein as loads and stores, respectively.
A given load/store specifies the transfer of one or more bytes beginning at a memory address calculated during execution of the load/store. This memory address is referred to as the data address of the load/store. The load/store itself (or the instruction from which the load/store is derived) is located by an instruction address used to fetch the instruction, also referred to as the program counter address (or PC). The data address is typically calculated by adding one or more address operands specified by the load/store to generate an effective address or virtual address, which may optionally be translated through an address translation mechanism to a physical address of a memory location within the memory.
Load and store operations are typically executed on a stage-by-stage basis within a processor pipeline. As the clock frequencies of processors continues to increase, these higher clock frequencies limit the levels of logic to fit within a single clock cycle. The deep pipelining trend has made it advantageous to predict the events that may happen in the pipe stages ahead. One example of this technique is latency speculation between an instruction and a younger (in program order) dependent instruction. The program order of instructions is the order in which the instructions would be executed if they were executed one at a time and non-speculatively. The program order is created by the programmer (and/or compiler) of the program being executed. In out-of-order processors, younger dependent instructions may be picked for out-of-order (o-o-o) issue and execution prior to a broadcast of the results of a corresponding older (in program order) instruction. The deep pipelining trend increases a latency to receive and use load (read) operation result data.
One example of the above instruction dependency and latency speculation is a load-to-load dependency. A younger (in program order) load instruction may be dependent on an older (in program order) load instruction. The older load instruction that produces the result data may be referred to as the producing (or producer) load instruction. The younger instruction dependent on the result data of the producing load instruction may be referred to as the consuming (or consumer) load instruction. When the target register of an older producing load (read) instruction is also an address register (source operand) of a younger consuming load instruction, the occurrence may be referred to as pointer chasing. Linked list traversals typically include frequent pointer chasing.
For load (read) instructions, the requested data may be retrieved from a cache line within a data cache. Alternatively, the requested data may be retrieved from a store queue, such as in the case when control logic determines whether a load-store dependency exists. Data forwarding of load results to dependent instructions may occur by sending the retrieved data to a reservation station and/or a register file. Afterward, the data may be sent to one or more execution units corresponding to the younger dependent instructions. The data forwarding incurs an appreciable delay. The traversal of one or more linked lists within a software application accumulates this delay and may reduce performance. The latency for receiving and using load instruction result data may vary depending on instruction order within the computer program. The traversal of a linked list is one case that may allow an opportunity to decrease the latency to use load instruction result data.
In view of the above, methods and mechanisms for reducing the latency of dependent load and store instructions are desired.