1. Field of the Invention
This invention relates in general to the field of microelectronics, and more particularly to a power saving mechanism to reduce load replays in an out-of-order processor.
2. Description of the Related Art
Integrated device technologies have exponentially advanced over the past 40 years. More specifically directed to the microprocessor fields, starting with 4-bit, single instruction, 10-micrometer devices, the advances in semiconductor fabrication technologies have enabled designers to provide increasingly more complex devices in terms of architecture and density. In the 80's and 90's so-called pipeline microprocessors and superscalar microprocessors were developed comprising millions of transistors on a single die. And now 20 years later, 64-bit, 32-nanometer devices are being produced that have billions of transistors on a single die and which comprise multiple microprocessor cores for the processing of data.
In addition to the employment of instruction parallelism in present day multi-core processors, out-of-order execution mechanisms are also prevalent. According to out-of-order execution principles, instructions are queued in reservation stations for execution by execution units, and only those instructions that are waiting on an operand as a result of the execution of older instructions are held up in the reservation stations; instructions that are not waiting on operands are dispatched for execution. Following execution, results are queued and put back into registers in proper order, typically in a processor stage called a retire state. Hence, the instructions are executed out of the original program order.
Out-of-order execution provides for significant throughput improvement since execution units, which would otherwise be sitting idle, are employed to execute younger instructions while older instructions await their operands. Though, as one skilled in the art will appreciate, instructions don't always execute successfully and, as a result, when a given instruction does not execute successfully, that instruction and all instructions that are younger than that instruction must be executed again. This concept is known as “replay,” because mechanisms in present day processors essentially stop current execution, back up the machine state to the point just prior to when the instruction executed unsuccessfully, and replay the unsuccessfully executed instruction along with all younger instructions, which may or may not have been dispatched prior to dispatch of the unsuccessfully executed instruction.
Replay, however, is an exceptional case, and the performance impacts of replays is very often negligible. Yet, the performance impact of holding instructions in reservation stations until their operands are available is significant, and microprocessor designers have developed acceleration techniques that allow certain instructions to be dispatched when there is a high probability that their operands will become available just prior to execution. Not only are these certain instructions dispatched, but mechanisms are put in place to provide their required operands just in time.
This application addresses one such acceleration technique where younger instructions that require an operand that is assumed with a high probability to be resident in an on-core cache memory are dispatched following a specified number of clock cycles after dispatch of a load instruction whose execution leads to retrieval of the operand from the cache. Accordingly, when the load instruction is dispatched, the younger instructions that are waiting on its operand are stalled in their respective reservation stations until the specified number of clock cycles have transpired, and then the younger instructions are dispatched for execution with high certainty that their required operand will become available.
The performance improvement resulting from utilization of the above noted acceleration technique is so substantial that microprocessor architects typically apply the techniques across the board to all load instructions (e.g., loads from I/O, uncacheable loads, loads from interrupt registers, special loads, etc.), even though it is certain that there are a number of load instructions that will take longer than the specified number of cycles to obtain their operand, thus requiring a replay of all younger instructions that were dispatched in anticipation that the operand would be available. The performance improvements resulting from this load acceleration technique more than offset the performance penalties incurred by infrequent replays.
But as multi-core processor technologies continue to advance, designers are now finding that certain processor resources, such as level 2 (L2) caches, interrupt controllers, fuse arrays, etc., which are infrequently accessed, are better suited for placement in a common area of a multi-core processor die rather than being replicated within each of the cores. Hence, resources such as those noted above, are shared by the processor cores. As one skilled in the art will appreciate, to load an operand from an off-core resource (say, a fuse array) takes substantially longer than is required to load from an on-core resource (say, an L1 cache). And even though the performance penalty that is incurred as a consequence of having to perform replays of younger instructions that were dispatched under the above acceleration technique is not substantial, it has been observed by the present inventors that the power utilization impact is notable, for a remarkable number of instructions are being executed under conditions where it virtually certain that they will be replayed. And the initial execution of these instructions essentially wastes power, thus being disadvantageous from the standpoints of battery life, thermal profile, and reliability.
Therefore, what is needed is an apparatus and method that enables power to be saved in a processor by reducing the number of replays that are required.
In addition, what is needed is a load replay reduction mechanism in an out-of-order processor that results in power savings for the processor.