1. Technical Field
The present invention relates in general to the field of computers, and in particular to an improved store-to-load forwarding scheme for processor runahead mode operation in a computer system which does not require a separate runahead cache structure in the processor.
2. Description of the Related Art
Historically, improvements in semiconductor memory latencies have failed to keep pace with corresponding increases in instruction processing speed of processors in computer systems. This is known as the “memory wall” and represents an increasing opportunity cost for execution by a processor of a storage instruction that “misses” the processor cache (i.e., the data being sought is not valid/available in the cache) and which must instead access main system memory directly. In existing technology generations, a last-level processor cache miss often stalls a processor for hundreds of processor cycles which can translate into thousands of missed instruction execution opportunities.
Out-of-order (OoO) execution in a processor has been used to mitigate the “cache miss” effect by speculatively pre-executing more recent independent instructions while waiting for a cache miss operation, to retrieve data directly from the system memory, to complete. This period of time is quite lengthy and causes inefficiency in processor operation, because the more recent instructions cannot be retired before the miss completes, and therefore they must be buffered within a processor core. A variety of microarchitectural resource and design constraints make scaling a processor buffer for storing the more recent instructions inherently difficult. Therefore, the number of outstanding unretired instructions is typically limited to a number far less than that required to fully compensate for and to resolve the processor cache miss latency problem.
The primary benefit of using OoO execution to speculatively pre-fetch and pre-execute instructions after a cache miss, but prior to cache miss completion, is to start pre-executing more recent independent instructions as early as possible after a detected cache miss. In this way, the latency of multiple independent cache misses can be overlapped with each other, which leads to improved processor performance. Runahead execution has been proposed as a prefetching mechanism to accomplish this. The functions associated with such a proposed runahead execution scheme are as follows:
1. When a last-level processor cache load miss is detected, the architected state of the processor state is checkpointed (i.e., the state of processor registers is copied to a checkpoint location in the system memory) and the processor enters runahead mode.
2. While in runahead mode, which occurs after a cache miss is detected, load instructions that miss the last-level processor cache do not wait for data to return from system memory. Instead, they immediately execute, and their respective result register is marked with a special not-a-value (NAV) bit to indicate the presence of a fictional value, as a place holder in the result register. Instructions that read a result register marked NAV cannot perform a useful computation, and are therefore skipped after they propagate the NAV bit to their respective destination register(s).
3. While functionally incorrect, load instruction misses, and their dependent instructions, are retired from the processor in the computer system and do not inhibit execution of more recent instructions.
4. When the initial load miss that caused a transition to runahead mode finally returns valid data from memory, an architected processor state is restored from the checkpoint location and execution resumes in the processor with the next instruction after the load miss instruction.
Because execution of instructions beyond the initial load miss leads to potentially incorrect results, since the subsequent instructions may be utilizing non-valid data resulting from instructions executing after the load miss because the load miss instruction has not yet returned valid data from memory, execution of instructions subsequent to the load miss instruction is required to eventually be “squashed” and restarted at that point. However, execution of instructions beyond the initial load miss instruction may still generate useful pre-fetches by uncovering load misses on those pre-fetched instructions independently of the initial load miss. Because execution of instructions in runahead mode is not limited by correctness requirements and does not wait for long-latency instructions to complete, instructions are allowed to execute arbitrarily into the future without stalling the processor. However, any architected processor state updates that occur during runahead mode must be “undone” during the transition back to the normal (non-runahead) mode of operation. Restoring execution of instructions from a previous checkpoint location accomplishes the “undoing” for register write instructions, since an architected processor register file is part of the checkpoint location; however, reversing the effects of system memory write instructions can be difficult.
One proposed solution is to simply drop store instructions executed during runahead mode. While conceptually simple, more recent load instructions that source the store instructions would receive stale data. However, if load data contributes toward “cache miss” address generation, the prefetching effect of this technique is diminished. Another proposed solution is to add a small separate explicit processor “runahead cache” structure designed to buffer speculative data produced by store instructions during runahead mode. Using the separate explicit runahead cache, upon exiting runahead mode, the processor runahead cache is cleared. However, while effective at propagating values between store instructions and load instructions, a processor runahead cache would consume additional chip area and power in the processor, and is therefore undesirable.
Thus, there is a need for an improved optimization scheme for a processor in a computer system which is effective at propagating values between store instructions and load instructions during processor runahead mode operation but which does not require any additional storage structure(s) in the processor.