1. Field of the Invention
This invention relates to computing systems, and more particularly, to efficient identification of dependent instructions on speculative load operations in a processor.
2. Description of the Relevant Art
Modern microprocessors typically have increasing pipeline depth in order to support higher clock frequencies and increased microarchitectural complexity. A modern pick queue, or alternative instruction scheduler, selects multiple dispatched instructions out of program order to enable more instruction level parallelism, which yields higher performance. Also, out-of-order (o-o-o) issue and execution of instructions helps hide instruction latencies. However, despite improved device speed, higher clock frequencies of next-generation processors allow fewer levels of logic to fit within a single clock cycle compared to previous generations. Also, the deep pipelining trend encourages designers to reduce the circuit complexity of individual stages in order to both maintain the high clock frequencies and manage power dissipation.
In addition, the deep pipelining trend has made it advantageous to predict the events that may happen in the pipe stages ahead. One example of this technique is latency speculation between an instruction and a younger (in program order) dependent instruction. These younger dependent instructions may be picked for out-of-order (o-o-o) issue and execution prior to a broadcast of the results of a corresponding older (in program order) instruction.
When the load instruction misses in the cache, a misprediction of the speculative load occurs. As the pipeline depth increases, both the load instruction latency increases and more corresponding younger dependent instructions will be scheduled before a load miss is identified. After a load miss, the younger dependent instructions that were picked and issued speculatively may need to be restarted or delayed in order that they execute again with correct inputs. This recovery process is called replay.
One approach for recovery includes deallocating the younger dependent instructions from the pick queue, or alternative scheduler, at the time of their early pick before a load miss. Then after a load miss, all younger instructions are re-fetched. While this approach eliminates storing post-issue instructions in the pick queue, performance may suffer due to the overhead associated with re-fetching the instructions.
In a similar manner, a second approach maintains storage of all younger (in program order) instructions relative to the load instruction in the pick queue until the older load instruction hit status is known. The speculative window for the load instruction begins with the cycle wherein the load instruction is picked and ends with the cycle wherein the load miss to the cache is detected. In the case of a load miss in the cache, there are two replay mechanisms that may be chosen for recovery. In flush replay, all instructions in the speculative window are flushed and re-executed whether or not they are dependent on the load instruction. As the speculative window increases, the number of instructions to re-execute increases. Therefore, flush replay is less desirable for deep pipelines and may reduce the benefit out-of-order speculation.
An alternative replay mechanism includes selective replay, wherein the processor only re-executes the instructions that depend on the older load instruction that missed in the cache. For selective replay, a mechanism is needed to construct the data dependence chain. One method includes broadcasting the corresponding reorder buffer (ROB) entry number of the older load instruction. However, this is a serial process that includes comparators, such as power consuming content-addressable memory (CAM) circuitry.
Another method for identifying younger dependent instructions includes the Half-Price architecture, see I. Kim, et al, Half-Price Architecture, In Proceedings of the 30th International Symposium on Computer Architecture (ISCA-30), June 2003. However, this method includes a shifting matrix for each source operand, wherein this matrix is a separate structure from a dependency matrix that may accompany a pick queue.
Each source operand shifting matrix has a width of the instruction issue width of the processor and a depth of the number of pipeline stages within the speculative window of the load instruction. Also, the authors point out that the combination of tag elimination with broadcast-based selective instruction replay is not practical to implement. Further, once a load instruction misprediction occurs, a kill vector is broadcast to each instruction where this vector is compared to the last row of each source operand matrix. The identification of the younger dependent instructions is not known until the completion of the broadcast and comparison operations.
A third method for identifying younger dependent instructions includes the use of timed queues, see A. Merchant, et al, Computer processor having a checker, U.S. Pat. No. 6,212,626, April 2001. However, this method inserts a delay into an instruction's execution latency through the use of a queue, rather than re-execute dependent instructions. If there is a load instruction misprediction, the corresponding output will not be set as ready, which will trigger a replay for the younger dependent instructions and its dependent instructions. Therefore, these dependent instructions may not begin execution as early as possible due to possible repeated replay.
In view of the above, efficient methods and mechanisms for efficient identification of dependent instructions on speculative load operations in a processor are desired.