1. Field of the Invention
The present invention relates to a method for executing instructions through pipeline processing and to a device therefor. In this method, a subpath is provided in the manner of detouring around a part of a main path for pipeline processing and assumes parts of the processes in the main path. This invention has possible applications in microprocessors having a coprocessor for example.
2. Description of the Prior Art
A single-chip RISC (Reduced Instruction Set Computer) is a device for simultaneously realizing high processing performance, low power consumption, and a small mounting area, primarily in specific applications including image processing. Recently, a dedicated arithmetic logic circuit has often been provided in this type of microprocessor to further enhance the arithmetic performance.
One example of this type of microprocessor is the V851 produced by NEC, Ltd. According to NEC Technology Report (Vol. 48, No. 3/1995, pages 42-47), the V851 adopts a pipeline RISC architecture that includes, in addition to an ordinary ALU, a hardware multiplier unit called an MULU for high speed execution of multiplication instructions.
FIG. 1 shows the internal organization of the V851. This diagram is made based on the schematic diagram in page 45 of the above document. As shown in the diagram, this microprocessor comprises an instruction memory 100 for storing instructions to be executed; an instruction fetch unit 101 for sequentially reading instructions; an instruction decoder 102 for decoding read instructions; general-purpose register group 103 to be accessed based on a general register number identified from a decoded result; a first execution unit 106 for receiving one or two source operand read from a general-purpose register via buses 114 and 115 and executing general operations (hereinafter "general execution") according to the decoded result; a memory access unit 108 for reading data necessary for a process from data memory 107 according to the result of the operation execution or writing data of a processed result into data memory 107; and a general-purpose register write unit 109 for receiving data read from data memory 107 via bus 116 and writing them into a predetermined register in general-purpose register group 103.
The above units together constitute a main path for pipeline processing. FIG. 1 also shows a bus 112 leading from memory access unit 108 to the input side of first execution unit 106, and a bus 113 leading from the output side of first execution unit 106 to the input side thereof. These buses 112 and 113 are necessary for achieving "a data forwarding" (described later).
In addition to the above, another path (hereinafter "subpath") is provided, detouring around a part of the main path. On this path, a second execution unit 110 (corresponding to MULU) is provided. A second execution unit 110, which is dedicated to multiplication operations, assumes the processes to be executed by the first execution unit 106 when a multiplication instruction is decoded. Pipeline processing flows from top to bottom along both of the main path and the subpath in FIG. 1.
Pipeline processing is executed through process units each called a stage. Each stage is processed in a constant time period determined according to an operation clock of a microprocessor. Many microprocessors, including the V851, execute instructions by dividing them into five stages outlined below. Each stage is named as follows, but this nomenclature is only for convenience.
1. I stage
Instructions are fetched (read) by instruction fetch unit 101 (see FIG. 1). PA1 Instructions are decoded by instruction decoder 102. A read operation from general-purpose registers is concurrently executed here. PA1 First execution unit 106 (ALU) executes general operations. Multiplication operations are executed by second execution unit 110 (MULU). A memory address is generated for use in the following stage. PA1 Memory access unit 108 accesses data memory 107. PA1 General-purpose register write unit 109 writes data which has been read into a general-purpose register. PA1 LD R1. (R8),
2. R stage
3. A stage
4. M stage
5. W stage
For the V851, each of the above stages is generally completed within one cycle (one clock) except for multiplication operations by MULU, which requires two cycles to complete.
FIG. 2 shows a state of pipeline processing by a general microprocessor having the organization shown in FIG. 1. Four instructions LD, LD, MUL, and ST are necessary to write a product of two data stored in data memory 107 back into the memory 107. In this diagram, any instruction XX follows ST instruction. With the initial instruction
data stored at address R8 in data memory 107 is read and transferred to register R1 in general-purpose register group 103. Similarly, with the following instruction, data stored at address R9 is transferred to register R2. With the third MUL instruction, data in registers R1 and R2 are multiplied with each other, and the result is stored in register R1. Finally, with ST instruction, data in register R1 is stored at address. R10 in data memory 107.
FIG. 2 shows a state where respective instructions are executed through five stages. At M stage of the first instruction, data is read from address R8 in memory. The read data is written into register R1 at the following W stage. Similarly, data is read from address R9 in memory at M stage of the subsequent instruction and written into register R2 at W stage thereof.
Execution of the third MUL instruction is held at A stage. For this, values in registers R1 and R2 must be ready by the start of this stage. In general, the value in register R2 is not ready until W stage of the second instruction (LD instruction). However, here, this value is extracted from M stage, which is an immediately preceding stage of W stage, via bus 112 (see FIG. 1), and transferred to A stage of MUL instruction as indicated by the arrow a, which is a data forwarding method. In this example, this arrangement is effective in making the A stage of MUL instruction start earlier by one cycle. (Note that forwarding with bus 113 is not related to this invention.)
MULU initiates execution of a multiplication operation at the start of the MUL instruction A stage, and completes it within two clock cycles by the end of M stage. For the last ST instruction, the data forwarding method is also applied (indicated with the arrow b) and data on the result of the multiplication operation is thereby transferred from the end of M stage of MUL instruction to the beginning of A stage of ST instruction. Subsequently, at M stage, that data is stored at address R10 in memory.
In FIG. 2, A stage of MUL instruction starts after completion of M stage of its immediately preceding LD instruction. Thus, a stage for waiting, denoted with (R), is inserted inbetween. Because of this insertion, the following ST instruction cannot progress to its R stage, so that a stage for waiting, denoted with (I), is inserted into the execution for ST instruction. Further, the following A stage of ST instruction must wait until completion of M stage of MUL instruction. This requires another (R) stage to be inserted for waiting. This in turn demands another (I) stage to be inserted for waiting into the execution for XX instruction. If a process period for one multiplication operation is defined as from the beginning of W stage of the initial LD instruction to the beginning of W stage of XX instruction, this microprocessor requires six cycles to complete one multiplication operation. Although there may be other ways to define a process period, the aforementioned definition is natural when the following periods after XX instruction will be similarly defined. This is because a process period headed by XX instruction can be counted, beginning with W stage thereof.
A process for writing a product of two data stored in a memory back into the memory is frequently applied over general signal processing, including image processing, etc. One of the main objects of a RISC microprocessor is to achieve utmost performance in some specific usages. In a macroscopic point of view, six cycles are consumed for executing four instructions in FIG. 2. Of those six cycles, however, one cycle is only for waiting. This waiting cycle is only necessary because one multiplication operation takes two cycles to complete. Thus, in theory, these six cycles can be reduced to five cycles.
One extra cycle is still needed because the MUL instruction A stage must wait until completion of the M stage of the immediately preceding LD instruction. A programming method has been known in order to solve this problem. In this method, two or more multiplication operations are arranged to be grouped together. For instance, if an operation requires processes for writing a product of values stored at addresses R10 and R11 of a memory back into the memory, in addition to the processes shown in FIG. 2, all LD instructions are coded prior to other instructions as follows:
______________________________________ LD R1, (R8) LD R2, (R9) LD R3, (R10) LD R4, (R11) MUL R1, R2 MUL R3, R4. ______________________________________
While the third and fourth LD instructions are executed, a wait stage for the initial MUL instruction becomes unnecessary. Similarly, while the initial MUL instruction is executed, a wait stage for the next MUL instruction also becomes unnecessary. However, this method relies on programming and is not usable in cases when an operation includes only one multiplication, such as is the case shown in FIG. 2. Thus, not only is such programming troublesome, but improvement through such programming is subject to a limit. In actuality, much improvement cannot be expected through programming.