1. Field of the Invention
The present invention relates to a parallel processing apparatus, and particularly to an improvement for efficiently supplying instructions in processors of a superscalar type.
2. Description of the Background Art
In recent years, microprocessors have been remarkably advanced, and particularly their performance and operation speed have been increased. However, high speed of semiconductor memories has not been increased enough to follow the increased operation speed of the microprocessors, and thus the access of the semiconductor memories provides bottle neck against increasing the speed of the processors. Therefore, parallel processing has been employed for improving the performance of the microprocessors.
As a method for achieving the parallel processing, there is a processing method called as a superscalar. A processor of this superscalar type (will be called merely as a "superscalar") has a construction, as shown in FIG. 1, in which a scheduler 200 in the superscalar detects parallelism in an instruction stream to supply instructions which can be processed in parallel to parallel pipelines P1, P2 and P3. That is; the superscalar is a processing device having following features.
(1) It fetches simultaneously a plurality of instructions.
(2) It has a plurality of function units (pipe lines) and can execute simultaneously a plurality of instructions.
(3) It finds simultaneously executable instructions among a plurality of fetched instructions, and simultaneously dispatches these simultaneously executable instructions to related function units (pipe lines).
FIG. 2 shows a general construction of a superscalar. In FIG. 2, the superscalar includes a plurality of function units 4, 5, 6 and 7 which perform predetermined functions, respectively, as well as instruction memory 1 for storing instructions, an instruction fetching (IF) stage 2 for fetching simultaneously a plurality of instructions from the instruction memory 1 and an instruction decoding (ID) stage 3 which receives instructions fetched by the instruction fetching stage 2. The instruction decoding stage 3 finds simultaneously executable instructions among the-instructions and dispatches them to corresponding function units. The superscalar further includes a data memory 8 for storing results of processing and others. The instruction memory 1 includes a cache memory and an external memory, and can perform high-speed reading of the instructions in a case of cache hit, i.e., if required instructions exist in the cache memory.
The instruction fetching stage 2 supplies an instruction pointer (IF.sub.-- PC) to the instruction memory 1 to fetch the instructions corresponding to this instruction pointer IF.sub.-- PC from the instruction memory 1.
The instruction decoding stage 3 includes an instruction decoder and a pipeline sequencer. The instruction decoder receives the fetched instructions from the instruction fetching stage 2 and decodes them. The pipeline sequencer (instruction scheduler) identifies, for example, machine types of the decoded instructions and dispatches simultaneously the instructions of different machine types to corresponding function units. The machine type is information for representing a function unit by which a particular instruction is processed.
Each of the function units 4-7 has a pipeline configuration, and executes an applied instruction in response to clock signals. In an example shown in FIG. 2, four function units are provided, and parallel processing of up to 4 instructions can be performed. The function units 4 and 5 are integer operational units for performing integer addition or others and include executing stages EX for executing the integer arithmetic and writing stages WB for writing results of executed processing into data registers.
The function unit 6 is provided for executing access (i.e., loading or storing of data) to the data memory 8, and includes an address forming stage (ADR), an access executing stage (MEM) for the data memory 8 and a writing stage (WB) for writing the data in a data register (not shown). In the writing stage (WB), loading of the data is performed.
A function unit 7 is provided for executing, e.g., a floating-point addition, and includes three executing stages (EX1, EX2 and EX3) and a writing stage (WB) for writing the execution result in data register. The floating-point number is a number expressed by an index and mantissa and having a point which is not fixed. The execution of the floating point arithmetic operation requires more cycles than the integer arithmetic and others.
In this parallel processing apparatus, each step has a pipeline configuration, and the operation periods for instructions fetching, instruction-decoding, instruction executing and data writing overlap each other. Therefore, an instruction fetched from the instruction memory will be decoded by the instruction decoding stage in a next cycle. An operation will be briefly described below.
The instruction decoding stage 3 supplies an instruction fetch request to the instruction fetching stage 2. The instruction fetching stage 2 supplies the instruction pointer IF.sub.-- PC to the instruction memory 1 in response to this instruction fetch request and fetches a plurality of instructions corresponding to the instruction pointer IF.sub.-- PC from the instruction memory 1. These fetched instructions are simultaneously supplied to the instruction decoding stage 3, which in turn simultaneously decodes these supplied instructions. Among the decoded instructions, the instruction decoding stage 3 detects instructions, which include non-competitive calculation resources and data registers and thus allows parallel processing, and issues or dispatches these instructions allowing parallel processing to the corresponding function units, respectively.
The function units supplied with the instructions execute the parallel processing in accordance with the instructions. The processing in the function units 4-7 are executed in a pipeline manner, and are executed sequentially in the respective stages shown in FIG. 2. Operations of the instruction fetching stage 2, instruction decoding stage 3 and instruction executing stage (function units 4-7) are performed in the pipeline manner, and overlap each other when predetermined operations are performed.
The operations of the respective stages in the pipeline manner and the parallel processing by the function units, which are described above, enable high-speed execution of the instructions.
Processors having parallel processing capability are disclosed, for example, in (1) "The i960CA Superscalar Implementation of the 80960 Architecture", by S. McGeagy, Proceedings of 35th COPMCON, IEEE 1990, pp 232-240, and (2) "An IBM Second Generation RISC Processor Architecture", by R. D. Groves, Proceedings of 35th COMPCON, IEEE 1990, pp 162-170. The above prior art (1) discloses a processor which simultaneously fetches four instructions, and can simultaneously execute three instructions by function units of REG, MEM and CTRL.
The prior art (2) discloses a RISC processor which simultaneously fetches four instructions. This RISC processor includes a floating-point processor, fixed point processor, branch processor and control unit, and can simultaneously execute the four instructions.
As described above, in the superscalar, a plurality of instructions are fetched, and a plurality of instructions are simultaneously executed, so that the processing speed can be increased as compared with ordinary computers. For example, in the construction shown in FIG. 2, when the four instructions which are simultaneously fetched are executed in parallel by the four function units 4-7, the four instructions can be processed in 4 clock cycles (in a case that the pipelines of the function units 4, 5 and 6 are in a waiting condition until the termination of the processing of the function units 7).
Although the instruction scheduler (or the pipeline sequencer included in the instruction decoding stage) executes the scheduling of instructions for efficiently executing the parallel processing, the instructions which are simultaneously fetched may not be simultaneously executed. As an example, instructions having following data dependency will be reviewed.
(1) add R1, R2, R3; R2+R3=R1
(2) sub R4, R1, R5; R1.sub.-- R5=R4
The above instruction (1) serves to add a content of the register R3 to a content of the register R2 and to write a result of addition in the register R1. Here, the superscalar is one of RISCs, and has a register file. The operation is executed using a register of the register file. Access to a memory (data memory) is performed only by loading and storing instructions.
The above instruction 2 serves to subtract a content in the register R5 from a content in the register R1, and write a result of the subtraction in the register R4. The operation by these instructions 1 and 2 correspond to the processing of, for example, (x+y.sub.-- z).
When the instructions 1 and 2 are simultaneously fetched, they commonly use the register R1 and the instruction 2 uses the result of the execution of the instructions 1, so that these instructions 1 and 2 cannot be simultaneously executed. If there is such data dependency between the instructions, conditions of issue of instructions (i.e., a form of issue of the instructions from the instruction decoding stage to the function units) can be as shown in FIG. 3.
FIG. 3 shows conditions in which only the instructions allowing simultaneous processing are issued. The simultaneously issued instructions are determined in an order from a smaller address to a larger address (from left to right in the Figure).
In FIG. 3, numerals encircled by squares represent instructions which can be issued without mutual data dependency.
In a cycle 1 shown in FIG. 3, the instructions 2, 3 and 4 have data dependency on each other or on the instruction 1. Therefore, the instructions 2, 3 and 4 cannot be issued, and only the instruction 1 is issued.
In a cycle 2, the instruction 4 has the data dependency on the instruction (2) and/or (3) and the instructions 2 and 3 have not mutual data dependency, so that the instructions 2 and 3 are issued.
In a cycle 3, the remaining instruction 4 is issued.
In a cycle 4, four instructions 5-8 which are newly fetched are decoded, and the instructions 5 and 6 having no mutual data independency are issued.
In a cycle 5, the instructions 7 and 8 do not have the mutual data dependency, and thus are issued. In a scheme shown in FIG. 3, fetching of the instructions is delayed until the last instructions 1-4 which were simultaneously fetched are entirely issued, in which case it requires five cycles between the issue of all the first fetched instructions 1-4 and the issue of all the subsequently fetched instructions 5-8. Therefore, in such an instruction supplying and issuing method, emptiness is formed in the pipelines, and thus the high-speed processing efficiency of the parallel processing apparatus is impaired.
Even in this case, if there were no data dependency between the instruction 4 and the instructions 5 and 6 in FIG. 3, and a given number of instructions could be fetched from the instruction memory, an instruction issue scheme shown in FIG. 4 would be allowed. FIG. 4 shows instruction issue conditions in an improved issue scheme.
Referring to FIG. 4, the instruction 1 among the simultaneously fetched four instructions 1-4 is issued in the cycle 1.
In the cycle 2, a new instruction 5 is supplied, and the instructions 2-5 are decoded. Among the four instructions 2-5, the instructions 2 and 3 having no dependency are issued.
In the cycle 3, new instructions 6 and 7 are supplied, and the instructions 4-7 are decoded. In accordance with a result of decoding, instructions 4, 5 and 6 are issued.
In the cycle 4, new instructions 8, 9 and 10 are supplied and the instructions 7-10 are decoded. In accordance with the result of this decoding, the instructions 7 and 8 are issued.
In the instructions issuing scheme shown in FIG. 4, only four cycles are required for issuing the eight instructions 1-8, and high-speed processing can be executed as compared with the scheme shown in FIG. 3. As one of methods for achieving the instruction issue scheme shown in FIG. 4, a method shown in FIG. 5 can be contemplated. FIG. 5 shows procedures for supplying instructions, by which the instruction issue conditions shown in FIG. 4 can be achieved.
In a step (1) shown in FIG. 5, when instructions 2 and 3 among the instructions 2-5 held by an instruction register, which is employed for holding the instructions, are issued, emptiness is formed in the instruction register.
In a step (2), a content of the instruction register is shifted by a number corresponding to the number of the empty registers. That is; in the step (2) in FIG. 5, register positions of the instructions 4 and 5 are respectively shifted leftward by two.
In a step (3), subsequent instructions 6 and 7 are fetched into these empty instruction registers. The steps (1)-(3) shown in FIG. 5 must be executed in one cycle. A construction shown in FIG. 6 may be contemplated for performing the instruction shifting operation shown in FIG. 5.
In FIG. 6, an instruction decoding stage includes instruction registers R1-R8 for storing the instructions, a barrel shifter BR for shifting the applied instructions and an instruction decoder ID. The instruction registers IR1-IR4 store the instructions fetched by the instruction fetching stage 2 and the instructions shifted by the barrel shifter. The instruction registers IR5-IR8 accommodate the instructions applied to the instruction decoder ID without decoding. The barrel shift BR shifts the instructions from the instruction registers IR5-IR8 in accordance with information of issued instruction count from the instruction decoder ID.
Usually, the parallel processing apparatus operates in accordance with clock signals T and L in two phases without mutual overlapping. An operation will be briefly described below.
Instructions which were fetched in the last cycle are held in the instruction registers IR1-IR4 in response to the clock signal T. The holding instructions of the instruction registers IR1-IR4 are applied to and decoded by the instruction decoder ID, and are issued to the instruction units in accordance with a decoded result.
The instruction registers IR5-IR8 are responsive to the clock signal to hold the instructions applied to the instruction decoder ID and apply the same to the barrel shifter BR. The barrel shift BR responds to the information of the issued instruction count from the instruction decoder ID by shifting the instructions applied from the instruction registers IR5-IR8. The contents of the barrel shifter BR are applied to and held by the instruction registers IR1-IR4 in response to the subsequent clock signal T.
In this case, the barrel shifter is required to complete the shifting operation for the instructions in a period between the clock signal L and the subsequent clock signal T. In the barrel shifter BR, however, an instruction length is long (e.g., 32 bits; the instruction length is fixed in RISC), so that the instruction shifting operation cannot be performed at a high-speed and requires a long period of time. Therefore, it is impossible to apply the instructions to the instruction registers IR1-IR4 to be held therein in response to the subsequent clock signal T, and thus it is impossible to supply the instructions to the instruction decoder ID at a high speed, which impairs the high-speed operability of the parallel processing apparatus.
The instruction memory is required to obtain information of a number of the empty registers in the instruction registers IR1-IR4 and determine the instruction supply number and the instruction supply positions based on this information. The information relating to the number of the existing empty registers in the instruction registers RI1-RI4 is obtained by the issued instruction count information from the instruction decoder ID. However, this issued instruction count information is issued after the decoding operation by the instruction decoder ID. Therefore, it requires a long time to determine the instruction supply number and instruction supply positions by the instruction memory, and a timing for starting the determining operation is delayed. Therefore, it requires a long period of time for fetching the desired instructions, and increased period of the clock signals T and L is required for executing the instructions without disturbing the pipeline architecture, which impairs the high-speed operability of the parallel processing apparatus.