1. Field of the Invention
This invention relates to computing systems, and more particularly, to efficient out-of-order picking of instructions in a processor.
2. Description of the Relevant Art
Modern microprocessors typically have increasing pipeline depth in order to support higher clock frequencies and increased microarchitectural complexity. Despite improved device speed, higher clock frequencies of next-generation processors allow fewer levels of logic to fit within a single clock cycle compared to previous generations. Certain loops of logic, or paths, within a processor may experience difficulty in fitting within a single pipeline stage. One such path is a wide instruction issue selection path that resides within an issue queue (IQ). The length of this logic path may be set by several factors including the size of the IQ, instruction dependencies, instruction latencies, the number and functionality of pipeline stages within a corresponding microarchitecture, and speculative instruction effects such as misprediction and recovery.
A modern IQ selects multiple dispatched instructions out of program order to enable more instruction level parallelism, which yields higher performance. Also, out-of-order (o-o-o) issue and execution of instructions helps hide instruction latencies. However, several instructions do not have single-cycle latencies. These multi-cycle instructions complicate the selection logic within an IQ. In addition, a latency between two instruction types, such as a floating-point arithmetic type and an integer arithmetic type, may not be consistent. For example, five pairings of these instruction types may yield a latency of 12 clock cycles between the producer generating a result and a consumer receiving the result. However, an additional sixth pairing may yield a latency of 13 clock cycles. This latter latency may set a final latency of 13 clock cycles for each of the six pairings, which decreases performance.
In addition to the above, one or more instructions may have a nondeterministic latency. For example, within an IQ, a load instruction that misses a cache may have an indeterminate time for generating a result. Within the IQ, a divide operation may have a source data dependent latency that is unknown at the time of instruction issue. Therefore, scheduling the issue of younger instructions dependent on these types of instructions with a nondeterministic latency is made more difficult. Additionally, the delay to generate a result from these instructions with a nondeterministic latency may increase as the pipeline depth increases.
By predicting a hit in the cache, a load instruction may be treated as a speculative instruction. In such a case, a known latency may be predicted. Younger instructions, including dependent instructions, may then issue early assuming that the load instruction hits in the cache. However, during a load miss, or misprediction of the speculative load, recovery occurs. During recovery, any younger instructions dependent on the mispredicted load instruction may then re-execute. One approach for recovery includes deallocating the younger dependent instructions from the IQ at the time of their early issue before a load miss. Then after a load miss, all younger instructions are re-fetched. While this approach eliminates storing post-issue instructions in the IQ, performance may suffer due to the overhead associated with re-fetching the instructions.
A second approach maintains storage of all younger instructions in the IQ until the older load hit status is known. In the case of a load miss, during recovery the younger dependent instructions may subsequently re-issue according to a predetermined policy. As pipeline depth increases, the speculative window of the load instruction increases. Accordingly, the size of the IQ increases as the number of instructions in the IQ waiting to be re-issued increases. These instructions fill a larger portion of the entries of the IQ. Unless a cache miss occurs, these post-issue instructions are not candidates for selection and they add complexity to the issue selection logic.
In addition to the above, parasitic capacitances and wire route delays continue to increase with each newer processor generation. Therefore, wire delays still limit the dimension of many processor structures such as an IQ. Within an IQ, the delay of a wide o-o-o issue selection path is proportional to the number of entries of the IQ. As stated earlier, higher clock frequencies allow fewer levels of logic to fit within a single clock cycle. In order for a processor to achieve high performance, the IQ needs to supply a sufficient number of instructions to functional units each clock cycle despite the various constraints mentioned above.
In view of the above, efficient methods and mechanisms for efficient out-of-order picking of instructions in a processor are desired.