1. Field of the Invention
The present invention relates to improvements of out of order CPU architectures regarding performance purposes. In particular it relates to an improved method and system for operating a high frequency out of order processor with increased pipeline length.
2. Description Disadvantages of Prior Art
The present invention has a quite general scope which is not limited to a vendor specific processor architecture because its key concepts are independent therefrom.
Despite of this fact it will be discussed with a specific prior art processor architecture.
Said prior art out of order processor in this example an IBM S/390 processor has as an essential component a so called Instruction Window Buffer, further referred to herein as IWB, too. After coming from an instruction cache and passed through a decode and branch prediction unit the instructions are dispatched still in order. In this out of order processor the instructions are allowed to be executed and the results written back into the IWB out of order.
In other words, after the instructions have been fetched by a fetch unit stored in the instruction queue and have been renamed in a renaming unit they are stored in order into a part of the IWB called reservation station. From the reservation station the instructions may be issued out of order to a plurality of instruction execution units abbreviated herein as IEU, and the speculative results are stored in a temporary register buffer, called reorder buffer, abbreviated herein as ROB. These speculative results are committed (or retired) in the actual program order thereby transforming the speculative result into the architectural state within a register file, a so called Architected Register Array, further abbreviated herein as ARA. In this way it is assured that the out of order processor with respect to its architectural state behaves like an in order processor.
Within the above summarized scheme, xe2x80x9cRenamingxe2x80x9d is the process of allocating a new register in the reorder buffer for every new speculative execution result. Renaming is done to avoid the so called xe2x80x9cwrite after readxe2x80x9d and xe2x80x9cwrite after writexe2x80x9d hazards that otherwise would prevent the out of order execution of the instructions. Each time a new register is allocated, a destination tag the instruction ID is associated with this register. With the help of this tag the speculative result of the execution is written in the newly allocated register. Later on, the in order completion process sets the architectural state by writing the speculative data into a architectural register or by setting a flag bit that specifies that the data has become part of the architectural state. In this way, the out of order processor behaves from an architectural point of view as if it executes all instructions in an in order sequence.
In a state of the art approach renaming is done according to the schemes shown in FIG. 1 and FIG. 2. In the upper portion of the figures the pipeline stages are illustrated whereas in the respective bottom part a structural overview is given. The main difference between the two schemes is the storing of source data or not storing of source data, respectively, into the issue queue. Therefore, the cycle in which the source data is read from the register file is different.
In particular, the first approach is illustrated in FIG. 1. During renaming 110 the logical register addresses are assigned with physical register addresses in which the source data for the instruction resides. Further, a new register is allocated in which the speculative result of the instruction will be stored after execution. Next, 110, the instruction is written into the issue queue 160, together with all its control bits (like opcode), source validity (if the source data is already available in the register file) and other bits as resulting from the renaming process. The wake up logic 170 of the issue queue will monitor the results produced by the execution units and will set the source that is dependent on the target result to valid for those instructions that are waiting in the issue queue for the specific result in stage 120. The select logic 170 will select commonly in an xe2x80x9coldest firstxe2x80x9d manner those instructions that will be issued to the execution units when all source data is available (i.e. source valid bits are ON). Once the select logic has selected the instruction that will be issued, the source address will be sent in the next cycle to the register file and the source data will be read from there, 130. Finally, in the last cycle as shown in FIG. 1 the execution 140 of the instruction is performed in an execution unit 190 thereby calculating the speculative result.
In FIG. 2 the alternative pipeline scheme is shown. The difference is that in this case the data is read from the register file 260 directly after renaming 210, 250 in case the source data is available. In stage 220, the instruction is inserted, into the issue queue 270, together with its source data read from the register file. It should be noted that the wake up logic 280 is required to firstly, set the valid bit of the source data and secondly, take care that the speculative results produced by the execution units 290 are written into the source data fields of the specific instruction that uses the speculative result as an input.
Both pipeline models are currently in use. The MIPS R10000, HP PA 8000 and the DEC 21264 are examples of processors that use the model shown in FIG. 1. On the other hand, Intel Pentium, Power PC 604 and HAL SPARC64 are based on the model shown in FIG. 2.
With the increasing number of circuits that fit onto a chip, processor designers enhance the performance of a processor by expanding the number of queue entries, by providing more execution units and especially, by designing the processor for a much higher frequency. Thereby, the trend in industry is especially towards very high frequency designs.
For processors with such a very high frequency target, the pipeline schemes shown in FIGS. 1 and 2 are no longer applicable since the logic delay between the pipeline registers becomes too large to support the requested high frequency of operation. To support a much higher frequency the pipeline depth has to increase. For example, the pipeline shown in FIG. 3 has been published in an article entitled xe2x80x9cIntel Willamette Processorxe2x80x9d, Cxe2x80x3t Magazin, Vol 5, 2000, pp 16-17. The total pipeline has 20 stages, what is double the number of pipeline stages of its predecessor, the xe2x80x9cIntel P6 processor (Pentium III).
The introduction of a much deeper pipeline has the advantage that the processor can run on a much higher frequency and therefore support a much higher throughput of the instructions. The drawback is, however, that the number of cycles needed for each Instruction to go through the pipeline also increases. Since the performance of the processor xe2x80x9cMIPS (Millions Instruction per Second)xe2x80x9d is equal to frequency divided by cycles per instructions (CPI) the performance gain by introducing a very deep pipeline remains limited.
Therefore, techniques that can reduce the pipeline length in performance critical cases are of great importance to increase the overall processor performance.
With reference to FIG. 4 the IWB macros are shown schematically. In this processor, the so called Instruction Window Buffer (IWB) comprises a renaming logic 415, an issue queue referred herein as reservation station (RS) 418, 420 and amongst others a register buffer 425 referred to herein as ReOrder Buffer (ROB) for holding the speculative results. The architectural results are stored in a Register File 430 called Architectural Register Array (ARA). The reservation station, the ARA and the ROB are connected with a multiplexer unit 450.
In FIG. 5 the respective pipeline scheme is shown. The IWB implementation scheme uses the basic pipeline scheme of FIG. 2 where the data is stored in the queue. It is, however, like the processor in ref 1 designed for a much higher frequency. Therefore, the pipeline shown in FIG. 5 has additional cycles in comparison to FIG. 2 to support this frequency target.
The more detailed operation of the FIG. 5 IWB pipeline will now be explained with reference to FIG. 4.
The fetch unit dispatches up to 4 instructions each cycle to the IWB in program order. The IWB pipeline starts with renaming 510 the up to 4 dispatched instructions. The fetch unit dispatches in program order up to 4 instructions each cycle to the IWB. The IWB pipeline starts with renaming, 510, the up to 4 dispatched instructions.
In the next cycle 520, called xe2x80x9cread ROBxe2x80x9d a plurality of signals RSEL (0 . . . 63) addresses the ReOrder Buffer. The ReOrder Buffer comprises: a tag specifying the reorder buffer entry directly or some other unique id, a valid bit, and the speculative result data. Furthermore, some other information may be stored in the ROB, like exception bits.
When the renaming logic has found a dependency for the source operand then the tag, valid bit and data is read from the ROB. In the write RS cycle 530, this information is stored in the Reservation Station (RS). When no dependency was found the data will be read from the ARA during the xe2x80x9cread ROBxe2x80x9d cycle 520 and the data together with valid bit set to ON is written for the source operand into the RS.
In the xe2x80x9cselectxe2x80x9d cycle 540, the instruction will be selected for issue when it is the oldest instruction that waits for issue and all the source operand data is available. Then during the issue cycle 550 the data is read out from the RS and finally in the EXE1 cycle 560 and EXE2 cycle 570 the execution of the instructions is done.
With reference now to FIG. 6 the renaming steps and the write after read conflict that can occur when all information that has to be written into the RS is read from a ROB entry. Furthermore, the possibility and disadvantages with respect to circumventing this write after read conflict by using longer pipelines will be discussed next below.
In FIG. 4, renaming, i.e., xe2x80x9cread dependent data from the ReOrder Buffer (ROB) 425 and the xe2x80x9cwrite into the Reservation Station (RS) 420xe2x80x9d is shown for a single source operand. It should be noted that each instruction may have several operands for which each renaming, read ROB and write RS is done in parallel. For the example given in here (see FIG. 6), the source operand is found dependent on the result of a previous instruction in the ROB to which the exemplary tag 5, see reference sign 625, is assigned.
Therefore, this entry is selected by the renaming logic read select output (RSEL) see back to FIG. 4 for read in the next cycle. After the tag 625, valid bit 630 and data 635 has been read out from the ROB. This data is present at the ROB output registers 640 at the end of the cycle. In the next xe2x80x9cwrite RSxe2x80x9d cycle this data is written into the source operand fields 645, 650 and 655, respectively, allocated in the RS 420 (see back to FIG. 4.) for the new instruction.
The problem that occurs is, however, that it takes a xe2x80x9cread ROBxe2x80x9d 520 and xe2x80x9cwrite RSxe2x80x9d cycle 530 before the tag can be used by the RS IEU tag compare logic. If the IEU writes data denoted symbolically as xe2x80x9cabcdxe2x80x9d in FIG. 6 into the ROB 425 entry that is just read out in the xe2x80x9cread ROBxe2x80x9d cycle, then the tag will not be present in the RS 420 yet. Hence the result data from the IEU will be stored in the ROB entry, but not in the RS operand field resulting in a write after read conflict. Therefore, in FIG. 6 the ROB entry with tag=5 will be written with xe2x80x9cabcdxe2x80x9d and the valid bit is turned ON, but the corresponding RS operand field remains xe2x80x9cxxxxxe2x80x9d and the valid bit remains OFF. Hence, a data inconsistency exists due to the so called write after read conflict between ROB 425 and RS 420 which usually leads to deadlock situations which needs to be avoided.
In processors with a traditional pipeline see FIG. 1 and FIG. 2, this problem is handled in several ways: The first prior art solution is that, the cycle time permits to write the tag during the renaming stage into the RS. Thereafter the validity bit and data is read from the ROB in the next cycle. The problem now no longer exists since the tag is already present in the RS and a match with the IEU tag will prioritize the write of data from the IEU instead of the data read from the ROB.
The second prior art solution is that, the IEU writes the data and sets the valid bit for the ROB entry before the read of the ROB starts. In other words, basically a write through cell is used or the clock cycle is partitioned into phase 1 and phase 2. During phase 1 the write is done and during phase 2 the read ROB/Write RS is done. So again, the longer cycle is exploited.
The third prior art solution is that bus snooping is done during xe2x80x9cread ROB/RS writexe2x80x9d called xe2x80x9cread RF/insertxe2x80x9d in FIG. 1 and FIG. 2. Here some additional logic compares the read out ROB tag with the IEU tags and in case of a hit the IEU data will be selected instead of the data read from the ROB. So the cycle time permits to do this snooping.
All these three solutions are used in current processors operating on a lower frequency as targeted in the present invention to keep the data in the ROB consistent with the data in the RS. For any high frequency design the problem of keeping the ROB data consistent with the RS data of the dependent operands needs to be revised.
Furthermore, the Instruction Execution Unit (IEU) protocol often having a delay between the result tag being available and the result data being available complicates the problem of keeping the ROB and RS consistent.
With reference to FIG. 7 the reason why the tag and data are available in different cycles is illustrated next below.
When an instruction is issued from the RS, then the result tag 715 xe2x80x9cres tagxe2x80x9d is read out together with the data 720 of the sources registers xe2x80x9csrc1 dataxe2x80x9d and xe2x80x9csrc2 dataxe2x80x9d. Furthermore, some other bits are read from the RS like the opcode bits that are not shown explicitly in FIG. 7. Hence, the result tag 740 is already available when the execution starts. The result data 780 is available after execution. In the case of a prior art IBM S/390 processor the execution takes 2 cycles leading to the two cycles delay between xe2x80x9ctag validxe2x80x9d and xe2x80x9cdata validxe2x80x9d as shown in FIG. 7. The valid bit 730 is set to ON when the associated src1 data (resp. Src2 data) 720 has become available and it corresponds to reference sign 630 see back to FIG. 6. The tag field 740 in FIG. 7 corresponds to the tag field 625 of FIG. 6. In pipeline stage 760 the result tag is directly valid, since it is directly supplied by the RS and the first part of the execution of the instruction is done by xe2x80x9cexe 1xe2x80x9d. Next, the second part of the instruction execution is done in stage 770 by the xe2x80x9cexe 2xe2x80x9d stage producing the result data at the end of the cycle. This result data is next valid during stage 780.
In case this IEU protocol is supported by the ROB and RS, and the pipeline length is adjusted such that write after read conflicts no longer occur then the pipeline shown in FIG. 8 results in having stages 810 to 895. In the bottom part of the figure, the points in time or cycle relationships are given in relation to the pipeline stages in the upper part of the figure.
The event xe2x80x9cwrite RS tagxe2x80x9d is depicted with reference sign 830 and in this stage the tag for each source register, as read from the ROB, is written into the RS entry for the instruction. This RS tag can be used for comparison with the result tag from an IEU one cycle later. It should be noted that for the event xe2x80x9cresult tag validxe2x80x9d as depicted with reference sign 835, in a cycle k the tag will not yet be available for compare (it is written into the RS) and therefore it is not recognized that result data 855 of the IEU that corresponds to the result tag 835 has to be written into the RS source data entry in case of a match between the source tag and the result data tag.
Hence the data 855 will only be written into its ROB entry and not into the source data field of the renamed instruction for which the tag was just written into the RS when the result tag 835 was valid. In this longer pipeline, the occurrence of a write after read conflict is prevented by simply performing the transfer of result data from ROB to the RS after the result data 855 has been written into the ROB. This write is done in stage 850, so when reading the result data in the following stage 860 from the ROB and writing it in 880 into the RS the consistency between the ROB and RS data is preserved and the write after read conflict is prevented at the cost of a much longer pipeline as compared to FIG. 5 in which the write after read conflicts may occur.
The similar situation occurs for the valid bit of the source data. The valid bit for source data in the RS is set when a match between the source tag and result tag is found. During xe2x80x9cresult tag validxe2x80x9d 835 the RS tag for the source is written and therefor still not set to xe2x80x9cundefinedxe2x80x9d during the compare of the result tag by the RS. Hence, in stage 830 only in the ROB the valid bit will be set based on the xe2x80x9cresult tag validxe2x80x9d 835.
The setting of the valid bit to ON for the RS source data field is done without conflicts by delaying the read of the valid bit from the ROB 840 (xe2x80x9cread ROB V Bitxe2x80x9d) until the cycle directly after 830, and writing the valid bit into the RS in stage 850. In other words, the consistency between the ROB data and RS data is preserved again at the cost of a longer pipeline. Such a longer pipeline is very costly from a performance perspective.
In particular, the pipeline depicted in FIG. 8 starts with the renaming cycle 810, xe2x80x9crenxe2x80x9d. In the next cycle 820, however, only the tag 625, see to FIG. 6, is read from the ROB entry and is written in the next cycle 830 into the RS 420 into the tag field 645.
When the IEU returns its data in a cycle k as depicted in FIG. 8, then the tag is just written into the RS. As mentioned before, if, due to the short cycle time, it is not possible to compare the tag after write with the xe2x80x9cres tagxe2x80x9d 715 of the IEU in the same cycle then the valid bit 730 will not be set for the source operand in the RS since the setting of the valid bit is triggered by the match of the tag of the operand with the tag (s) returned by the IEU(s).
To set the valid bit for the source operand in the reservation station the valid bit is read from the ROB in the next cycle k+1, stage 840, and then written into the RS during stage 850. This setting of the valid bit in the RS could of course also be implemented by adding another tag compare that compares the delayed tag. However this is very costly from a point of area cost.
The matching tag for a source operand in the RS also triggers the write of data and therefore also the data will not be written into the RS for the IEU: cycle k case. Therefore, the pipeline must wait until the data is written into the ROB and then read the data from there in the xe2x80x9cread ROB dataxe2x80x9d and xe2x80x9cwrite ROB dataxe2x80x9d cycle. So this solution leads to a very long pipeline between the rename of the instruction and the start of the execution in the xe2x80x9cexe 1xe2x80x9d cycle.
The pipeline could be reduced by doing techniques like snooping in the ROB as well as the RS. This, however, could be done only at the cost of frequency as mentioned before.
It is thus an object of the present invention to reduce the pipeline length despite the conflict situations described above.
This object is achieved by the features stated in enclosed independent claims. Further advantageous arrangements and embodiments of the invention are set forth in the respective subclaims.
A primary aspect of the present invention invloves a method for operating an out of order processor which comprises the steps of: processing said pipeline in a compressed way; providing a separate logic for detecting a dependency conflict associated with an instruction currently to be renamed; setting a conflict flag reflecting the detection result; and continuing the processing dependent on the conflict flag.
Various other objects, features, and attendant advantages of the present invention will become more fully appreciated as the same becomes better understood when considered in conjunction with the accompanying drawings, in which like reference characters designate the same or similar parts throughout the several views.