1. Field of the Invention
The present invention relates to improvements of out-of-order CPU architectures regarding performance purposes. In particular it relates to an improved method and system for operating a high frequency out-of-order processor with increased pipeline length.
2. Description of the Prior Art
The present invention has a quite general scope which is not limited to a vendor-specific processor architecture because its key concepts are independent therefrom.
Despite of this fact it will be discussed with a specific prior art processor architecture.
The prior art out-of-order processor, in this example an IBM S/390 processor has as an essential component a so-called Instruction Window Buffer, further referred to herein as IWB.
After the instructions have been fetched by a fetch unit, passed through a decode and branch prediction unit, stored in the instruction queue and have been renamed in a renaming unit they are stored in a part of the IWB called reservation station. From the reservation station the instructions may be issued out to a plurality of instruction execution units abbreviated herein as IEU, and the speculative results are stored in a temporary register buffer, called reorder buffer, abbreviated herein as ROB. These speculative results are committed (or retired) in the actual program order thereby transforming the speculative result into the architectural state within a register file, a so-called Architected Register Array, further abbreviated herein as ARA. In this way it is assured that the out-of-order processor (also referred to herein as an outprocessor) with respect to its architectural state behaves like an in-order processor.
Within the above summarized scheme, “Renaming” is the process of allocating a new register in the reorder buffer for every new speculative execution result. Renaming is done to avoid the so-called “write-after-read” and “write-after-write” hazards that otherwise would prevent the out-of-order execution of the instructions. Each time a new register is allocated, a destination tag—the instruction ID—is associated with this register. With the help of this tag the speculative result of the execution is written in the newly allocated register. Later on, the incompletion process sets the architectural state by writing the speculative data into a architectural register or by setting a flag bit that specifies that the data has become part of the architectural state. In this way, the outprocessor behaves from an architectural point of view as if it executes all instructions in an in-order sequence.
In a state of the art approach renaming is done according to the schemes shown in FIG. 1 and FIG. 2. In the upper portion of the figures the pipeline stages are illustrated whereas in the respective bottom part a structural overview is given. The main difference between the two schemes is the storing of source data or not of source data, respectively, into the issue queue. Therefore, the cycle in which the source data is read from the register file is different.
In particular, the first approach is illustrated in FIG. 1. During renaming 110 the register addresses are assigned in which the source data for the instruction resides. Further, a new register is allocated for each dispatched instruction in which the speculative result of the instruction will be stored after execution. Next, 110, the instruction is written into the issue queue 160, together with all its control bits (like opcode), source data valid (indicates if the source data is already available in the register file) and other bits as resulting from the renaming process. The wake up logic 170 of the issue queue will monitor, 120, the results produced by the execution units and will set the source to valid for those instructions that are waiting in the issue queue for the specific result. The select logic 170 will select—commonly in an “oldest-first” manner those instructions that will be issued to the execution units. Once the select logic has selected the instruction that will be issued, the source address will be sent in the next cycle to the register file and the source data will be read from there, 130. Finally, in the last cycle as shown in FIG. 1 the execution 140 of the instruction is performed in an execution unit 190 thereby calculating the speculative result.
In FIG. 2 the alternative pipeline scheme is shown. The difference is that in this case the data is read from the register file 260 directly after renaming 210, 250. A undefined value is read in case the source data is not yet available. Next, the instruction is inserted, 220, into the issue queue 270, together with its source data read from the register file. It should be noted that the wake-up logic 280 is required to firstly, set the valid bit of the source data and secondly, take care that the speculative results produced by the execution units 290 are written into the source data fields of the specific instruction that uses the speculative result as an input.
Both pipeline models are currently in use. The MIPS R10000, HP PA and the DEC 21264 are examples of processors that use the model shown in FIG. 1. On the other hand, Intel Pentium, Power PC 604 and HAL SPARC64 are based on the model shown in FIG. 2.
With the increasing number of circuits that fit onto a chip, processor designers enhance the performance of a processor by expanding the number of queue entries, by providing more execution units and especially, by designing the processor for a much higher frequency. Thereby, the trend in industry is especially towards very high frequency designs.
For processors with such a very high frequency target, the pipeline schemes shown in FIGS. 1 and 2 are no longer applicable since the delay of the logic between the pipeline registers is too large. To support a much higher frequency the pipeline depth has to increase. For example, the pipeline shown in FIG. 3 has been published in “Intel Willamette Processor”, C-t Magazin, Vol 5, 2000, pp 16 The total pipeline has 20 stages, which is double the number of pipeline stages as its predecessor, the “Intel P6 processor” had.
The introduction of a much deeper pipeline has the advantage that the processor can run on a much higher frequency and therefore support a much higher throughput of the instructions. The drawback is, however, that the number of cycles needed for each Instruction to go through the pipeline also increases. Since the performance of the processor “MIPS” is equal to frequency divided by cycles per instructions (CPI) the performance gain by introducing a very deep pipeline remains limited.
Therefore, techniques that can reduce the pipeline length in performance critical cases are of great importance to increase the overall processor performance.