The present invention relates to CPU's, such as minicomputers or microcomputers, and in particular to a parallel processing apparatus and a parallel processing method suitable for high-speed operation.
Various contrivances have hitherto been made to attain higher-speed operation of computers. One of the representative techniques is the "pipeline" technique. Instead of starting the next instruction after completion of processing of one instruction, each instruction is divided into a plurality of stages in the pipeline technique. When the first instruction comes to its second stage, processing of a first stage of the next instruction is started. Processing is thus performed in a bucket relay manner. Such a method is discussed in detail in Shinji Tomita, "Parallel Computer Structure Review", Shokodo, pp. 25-68. If an n-stage pipeline scheme is used, one instruction is processed at each pipeline stage. As a whole, however, n instructions can be processed simultaneously. Processing of one instruction can be finished every pipeline pitch.
It is well known that the instruction architecture of a computer has a great influence on its processing scheme and processing performance. From the viewpoint of instruction architecture, computers can be classified into CISC's (complex instruction set computers) and RISC's (reduced instruction set computers). In the CISC's, complex instructions are processed by using microinstructions. Instead of narrowing down instructions to simple ones, higher speed is sought with control using hard-wired logic without using microinstructions in the RISC's. An outline of hardware and pipeline operation of both CISC's and RISC's of the prior art will hereafter be described.
FIG. 2 shows typical configuration of a computer of CISC type. Numeral 200 denotes a memory interface, 201 a program counter (PC), 202 an instruction cache, 203 an instruction register, 204 an instruction decoder, 205 an address calculation control circuit, 206 a control storage (CS) for storing microinstructions therein, 207 a microinstruction counter, 208 a microinstruction register, 209 a decoder, 210 an MDR (memory data register) which is a register for transmitting/receiving data to/from a memory, 211 an MAR (memory address register) which is a register for indicating an operand address on the memory, 212 an address adder, 213 a register file, and 214 an ALU (arithmetic and logic unit).
An outline of the operation will now be described. An instruction indicated by the PC 201 is taken out from the instruction cache and set into the instruction register 203 via a signal 217. The instruction decoder 204 receives the instruction via a signal 218 and sets the leading address of the microinstruction into the microinstruction counter 207 via a signal 220. In addition, the instruction decoder 204 informs the address calculation control circuit 205 of the address calculation method via a signal 219. The address calculation control circuit 205 performs register readout required for address calculation and control of the address adder 212. The register required for address calculation is transmitted from the register file 213 to the address adder 212 via buses 226 and 227. On the other hand, microinstructions are read out from the CS 206 every machine cycle, decoded by the decoder 209, and used to control the register file 213. Numeral 224 denotes these control signals. The ALU performs arithmetical operations on data transmitted from registers through buses 228 and 229 and stores the result into the register file 213. The memory interface 200 is a circuit used for correspondence with the memory, such as instruction fetch and operand fetch.
Pipeline operation of the computer shown in FIG. 2 will now be described by referring to FIGS. 3, 4 and 5. The pipeline comprises six stages. At an IF (instruction fetch) stage, an instruction is read out from the instruction cache 202 and set into the instruction register 203. At a D (decode) stage, instruction decoding is performed by the instruction decoder 204. At an A (address) stage, operand address calculation is performed by the address adder 212. At an OF (operand fetch) stage, an operand of an address specified by the MAR 211 is fetched and set into the MDR 210. Succeedingly at an EX (execution) stage, data are called from the register file 213 and the MDR 210 and transmitted to the ALU 214 to undergo an arithmetic operation. Finally, at a W (write) stage, the result of the arithmetic operation is stored into one register included in the register file 213 through the bus 230.
FIG. 3 shows how add instructions ADD are consecutively processed. The add instruction ADD is one of the basic instructions. One instruction is processed every machine cycle. Both the ALU 214 and the address adder 212 operate in parallel every cycle.
FIG. 4 shows how a conditional branch instruction BRAcc is processed. A flag is generated by a TEST instruction. FIG. 4 shows a flow performed when a condition is satisfied. Since flag generation is performed at the EX stage, three waiting cycles are caused until an instruction of jump destination is fetched. As the number of pipeline stages is increased, these waiting cycles increase, resulting in an obstacle to performance enhancement. FIG. 5 shows an execution flow of a complicated instruction. An instruction 1 is a complicated instruction. Complicated instructions are instructions having a large number of memory accesses such as string copy, for example. The complicated instruction is processed typically by extending the EX stage a large number of times. The EX stage is controlled by a microinstruction. The microinstruction is accessed once every machine cycle. That is to say, the complicated instruction is processed by reading a microinstruction out of the microprogram a plurality of times. At this time, only one instruction enters the EX stage, and hence a succeeding instruction (instruction 2 of FIG. 5) is made to wait. At such time, the ALU 214 always operates, but the address adder 212 has idle time.
The RISC computer will now be described. FIG. 6 shows typical configuration of an RISC computer. Numeral 601 denotes a memory interface, 602 a program counter, 603 an instruction cache, 604 a sequencer, 605 an instruction register, 606 a decoder, 607 a register file, 608 an ALU, 609 an MDR, and 610 an MAR.
FIG. 7 shows the processing flow of basic instructions. At the IF (instruction fetch) stage, an instruction specified by the program counter 602 is read out from the instruction cache and set into the instruction register 605. On the basis of an instruction signal 615 and a flag signal 616 supplied from the ALU 608, the sequencer 604 controls the program counter 602. At the R (read) stage, a register indicated by the instruction is transferred from the register file 607 to the ALU 608 through buses 618 and 619. At the E (execution) stage, an arithmetic operation is conducted by the ALU 608. Finally at the W (write) stage, the result of arithmetic operation is stored into the register file 607 through a bus 620.
In RISC computers, instructions are limited to only basic instructions. Arithmetic operations are limited to those between registers. Instructions accompanied by operand fetch are only a load instruction and a store instruction. Complicated instructions are implemented by combining basic instructions. Further, microinstructions are not used, but contents of the instruction register 605 are directly decoded by the decoder 606 to control the ALU 608 and so on.
FIG. 7 shows the processing flow of arithmetic operations between registers. Since the instruction is simple, the pipeline comprises only four stages.
FIG. 8 shows the processing flow at the time of a conditional branch. Since the number of pipeline stages is smaller than that of a CISC computer, the number of waiting cycles is small. In the example shown in FIG. 8, the number of waiting cycles is only one. In addition, RISC computers generally use the delayed branch scheme for effectively using this one waiting cycle as well. In this scheme, an ADD instruction succeeding the BRAcc instruction is executed during the waiting cycle as shown in FIG. 9. Since the compiler thus buries an instruction next to the branch instruction, useless waiting cycles can be completely eliminated.
However, RISC computers capable of thus performing efficient execution have a drawback that only one instruction can be executed in one machine cycle.
For recent RISC computers, therefore, a scheme as described in U.S. Pat. No. 4,766,566, "Performance Enhancement Scheme For A RISC Type VLSI Processor Using Dual Execution Units For Parallel Instruction Processing" has been devised. In that scheme, a plurality of arithmetic units sharing a register file are provided, and instructions are simplified to reduce the number of pipeline stages. In addition, a plurality of instructions are read out in one machine cycle to control the plurality of arithmetic units.
In actual. RISC computers, however, instructions are processed one after another by using a single arithmetic unit. If a plurality of instructions are executed in parallel by using a plurality of arithmetic units, therefore, the same operation cannot be assured. In interrupt processing, for example, m instructions are simultaneously processed. By taking m instructions as a unit, therefore, when an interrupt is accepted, an operation different from that of successive processing of the prior art results. Further, software such as a debugger having a function of executing instructions by taking one instruction as a unit cannot be used, resulting in one of the drawbacks.
On the other hand, a scheme which makes the above described special software unusable but makes most of the conventional software usable and allows high-speed execution is sufficiently useful. The most important matter in such a scheme is to solve a problem concerning how m instructions including a delayed branch instruction, described before with reference to FIG. 9, should be executed in parallel in order to obtain the same execution result as that obtained in case of successive execution.