The present invention relates to a method of controlling parallel processing at an instruction level and a processor for realizing the controlling method.
As techniques relating to data processing are improved, it has been proposed to improve the performance of a data processing computer by increasing the processing speed by adoption of parallel processing at the instruction level. Existing techniques relating to an instruction level parallel processing control method will be explained in the following by referring to the time charts shown in FIGS. 13-18. Further, an example of a program used in this explanation is shown in FIG.4. The program includes plural instructions (1e,1o,2e,2o,3e,3o,4e and 4o) executed in the order shown in FIG. 4. In this program, as to the instructions 1e and 1o, parallel processing is possible; further, as to 1o and 2e, parallel processing is impossible; as to 2e and 2o, parallel processing is impossible as to 2o and 3e, parallel processing is possible; as to 3e and 3o, parallel processing is impossible; as to 3o and 4e, parallel processing is possible; and as to 4e and 4o, parallel processing is possible.
In a first existing technique, judging the possibility of parallel processing is carried out at an instruction decoding stage. An operation time chart of the processing is shown in FIG. 13, wherein the abscissa indicates the time lapse, and one division corresponds to one machine cycle. In the vertical direction, the processing stages of the hardware are indicated in order. More particularly, in the PC stage, instructions are stored into an instruction cache; further, in the IF stage, instructions are fetched from the instruction cache and stored into an instruction buffer; in the D stage, instructions are decoded and issued; and in the E stage, instructions, such as a numerical calculation, are implemented by execution units. In the figure, a circle indicates a unit of instructions fetched in a cycle at the stages PC and IF, and a unit of instructions issued in a cycle at the stages D and E. In the following explanation of the operation time chart, at most two instructions are fetched, and instructions are issued. But, in the parallel processing of more than 2 instructions, the operations of the parallel processing are likely carried out.
The fetched instruction unit 1 consists of the instructions 1e and 1o. And, as time passes in the order of machine cycles 301, 302, and so on, the processing proceeds in the order of the stages PC, IF, and so on. Then, at the D stage in the machine cycle 303, as to the instructions 1e and 1o, the possibility of parallel processing is judged. Since the parallel processing of the instructions is possible, both the instructions 1e and 1o are issued. Further, in the machine cycle 304, the fetched instruction unit 2 goes into the stage D, and the parallel processing of the instructions 2e and 2o is judged to be impossible. Then, only the instruction 2e is issued. In the figure, a hatched instruction indicates an instruction not to be issued. Further, in the machine cycle 305, the fetched instruction unit 3 goes into the stage D, and the parallel processing of the instructions 2o and 3e is judged to be possible. Then, both the instructions are issued. And, in the machine cycle 306, the fetched instruction unit 4 goes into the stage D, and the parallel processing of the instructions 3o and 4e is judged to be possible. Thus, both instructions are issued.
In a second existing technique, a stage of judging the possibility of parallel processing is inserted between the instruction fetching stage IF and the instruction decoding stage D, and an operation time chart of the processing is shown in FIG. 14.
As shown in the figure, the D1 stage of judging the possibility of the parallel processing is added, and then the D2 stage becomes the decoding stage. In the machine cycle 403, the fetched instruction unit 1 goes into the stage D1, and the parallel processing of the instructions 1e and 1o is judged to be possible. Thus, both the instructions 1e and 1o are issued. Then, in the machine cycle 404, the fetched instruction unit 2 goes into the stage D1, and the parallel processing of the instructions 2e and 2o is judged to be impossible. Thus, only the instruction 2e is issued. In the same manner as mentioned above, in each cycle of the machine cycles 405 and 406, the possibility of parallel processing of a pair of the instructions 2o and 3e, and a pair of the instructions 3o and 4e, is judged to be possible, respectively. Thus, each pair of instructions is issued.
In a third existing technique, an instruction buffering stage, for holding fetched instructions until the instructions are issued, is incorporated in the first existing technique, and the instructions stored in the instruction buffer are checked to determine whether they conflict with each other at the decoding stage in order to judge the possibility of parallel processing of the instructions, the operation time chart of which is shown in FIG. 15.
In the figure, IBR indicates the stage of fetching and storing instructions into the instruction buffer. As shown in the figure, in the machine cycle 1103, the fetched instruction unit 1 goes into the stage D, and the parallel processing of the instructions 1e and 1o is judged to be possible. Thus, the instructions 1e and 1o are issued. Further, in the machine cycle 1104, the fetched instruction unit 2 goes into the stage D, and the parallel processing of the instructions 2e and 2o is judged to be impossible. Thus, only the instruction 2e is issued. And, like operations follow.
In a fourth existing technique, an instruction buffering stage for holding fetched instructions until the instructions are issued, is incorporated in the second existing technique, and, as to the instructions stored in the instruction buffer, a stage for judging the possibility of parallel processing is inserted between the instruction buffering stage IBR and the instruction decoding stage D, and an operation time chart of the processing is shown in FIG. 16.
As shown in the figure, the D1 stage of judging the possibility of parallel processing is added, and then the D2 stage becomes the decoding stage. In the machine cycle 1203, the fetched instruction unit 1 goes into the stage D1, and the parallel processing of the instructions 1e and 1o is judged to be possible. Thus, both the instructions 1e and 1o are issued. Then, in the machine cycle 1204, the fetched instruction unit 2 goes into the stage D1, and the parallel processing of the instructions 2e and 2o is judged to be impossible. Thus, only the instruction 2e is issued. And, like operations follow.
The first to fourth existing techniques adopt the method of judging the possibility of parallel processing after the instructions to be judged are issued from the instruction register unit. And, after the judgement of the possibility of parallel processing, execution of the issued instructions is started.
Now, there is mentioned in JP-A-130634/1990 and JP-A214785/1994, a fifth existing technique which checks the fetched instructions to determine whether they conflict with each other, in order to judge the possibility of parallel processing of the instructions, before they are written into the instruction cache. That is, the technique provides for determination of the possibility of parallel processing of the instructions to be written into the instruction cache, and for storage of the results of judging the possibility of parallel processing. Then, in reading out instructions from the instruction cache, the judgement results are simultaneously read out, and the instruction level parallel processing is executed by using the judgement results.
An operation time chart of the fifth existing method is shown in FIG. 18. As shown in the figure, in the machine cycle 502, the fetched instruction unit 1 is read out from the instruction cache, and the judgement result that the parallel processing is possible as to the instructions 1e and 1o is also read out at the same time. Then, in the machine cycle 503, the instructions 1e and 1o are issued together. And, in the machine cycle 503, the fetched instruction unit 2 is read out from the instruction cache, and the judgement result that the parallel processing is impossible as to the instructions 2e and 2o is also read out at the same time. And, only the instruction 2e is issued in the machine cycle 504. Then, in the machine cycle 505, only the instruction 2o is issued.
In the above-mentioned processing, the possibility of parallel processing as to the instructions 2o and 3e is not performed. Therefore, the instructions 2o and 3e are never issued together. In the machine cycle 505, the fetched instruction unit 3 is read out from the instruction cache, and the judgement result that the parallel processing is impossible as to the instructions 3e and 3o is also read out at the same time. Thus, only the instruction 3e is issued in the machine cycle 506. Then, in the machine cycle 507, only the instruction 3o is issued.
However, the above-mentioned existing techniques have the following problems.
The first and third existing techniques have a problem that one machine cycle becomes long and the operation frequency decreases, since the series of processing steps to determine the possibility of parallel processing is carried out after instructions are set in the instruction register unit, and the instructions are decoded after determining the possibility of parallel processing and are executed, so as to be implemented in one machine cycle.
In the second and fourth existing techniques, since an exclusive stage for determining the possibility of parallel processing is provided, the parallel processing is executed without any decrease of the operating frequency. However, the execution speed of a branch instruction decreases due to the addition of the exclusive stage, and the penalty involved in executing the branch instruction increases, which deteriorates the performance of the processing.
The penalty is explained with reference to FIG. 17. Assuming that the instruction 1e is a branch instruction, after the instruction 1e is implemented and a destination address is decoded, an instruction at the destination address is fetched. Then, as shown in the figure, it takes 3 cycles of penalty from execution of the instruction 1e to execution of the instruction at the destination address.
In the fifth existing technique, the above-mentioned problems of the first to fourth existing techniques are avoided since the operation frequency does not decrease by judging the possibility of parallel processing before writing instructions into the instruction cache and an exclusive stage for judging is not added. However, the fifth existing technique has a problem in that, the possibility of parallel processing is judged only as to instructions of the same line of the instruction cache. That is, since the possibility of parallel processing is not judged for instructions of different lines of the instruction cache, the cases for which parallel processing is executed are fewer than in the other existing techniques and the processing speed becomes lower. For example, in the first to fourth existing techniques shown in FIGS. 13-16, it takes four machine cycles to execute the instructions 1e to 3o. On the other hand, in the fifth existing technique, it takes five machine cycles to execute the instructions 1e to 3o, as shown in FIG. 18.