1. Field of the Invention
The present invention relates to a parallel processing unit such as a microprocessor employing a data forwarding technique, and an instruction issuing system for the parallel processing unit.
2. Description of the Prior Art
Recent microprocessors try to improve processing efficiency by increasing operation frequencies, which is realized by increasing the number of stages in each pipeline and by issuing instructions to the pipelines at increased pitches. Increasing the instruction issuing pitches, however, elongates a latency time between the issuance of an instruction and the time when a resultant data of the instruction is ready for use for the next instruction. This results in deteriorating processing efficiency when a given instruction is dependent on a resultant data of the preceding instruction.
To shorten such a latency time and improve processing efficiency, a data forwarding technique is used. This technique writes the resultant data of a given instruction into a data holder, and at the same time, transfers the resultant data to an instruction issuer that issues instructions to be processed next, to save the time for writing and reading data to and from the data holder.
This technique also employs a data state holder that holds the data dependence of a given instruction. A typical example of the data state holder is a scoreboard. The scoreboard is a register to store an address of the data holder at which presently processed data is going to be stored. If the address of data required by a given instruction agrees with an address stored in the scoreboard, the given instruction is dependent on presently processed data, and therefore, the given instruction will not be issued until the data in question is completely processed and ready for use.
High-performance microprocessors employ a parallel processing technique such as a superscalar architecture to simultaneously issue and execute a plurality of instructions, to improve IPC (instructions per clock). They also employ, instead of an in-order instruction issuing technique that issues a stream of instructions in order, an out-of-order instruction issuing technique that issues a stream of instructions out of order if there is no data dependence among the instructions, to improve processing efficiency.
If a first instruction is queued due to data dependence, the in-order instruction issuing technique also queues a second instruction that follows the first instruction. On the other hand, the out-of-order instruction issuing technique issues the second instruction before the first instruction, if the second instruction has no data dependence.
Since the out-of-order instruction issuing technique issues instructions without regard to the order of the instructions, a hind instruction may be processed before a fore instruction. This may happen even in the in-order instruction issuing technique when processors having different processing periods are used. For example, it will occur when an adder having a single pipeline stage and a multiplier having three pipeline stages are used. If the order of instructions after their execution is different from a stream of the instructions, a problem will occur when writing results of the executed instructions into the data holder. In particular, if an exception interrupt occurs, it will be difficult to restore processing conditions. A standard technique to solve this problem is to use a reorder buffer, which rearranges the results of the execution of instructions according to a stream of the instructions and writes the rearranged results into the data holder. The execution results are also forwarded to an instruction issuer so that they are used by the next instructions.
FIG. 1 is a block diagram showing a parallel processing unit according to a prior art employing the techniques mentioned above.
An instruction cache 100 stores instructions, which are executed in parallel by processors 110, 111, and 112.
Each instruction read out of the instruction cache 100 is once stored in a latch 10A. Thereafter, a register number of source data of the instruction is transferred to a register file 104, a scoreboard 105, and a reorder buffer 106. The register file 104 and scoreboard 105 correspond to the data holder and data state holder mentioned above.
The register file 104 stores resultant data provided by the processors 110 to 112. The scoreboard 105 stores register numbers of the register file 104. The reorder buffer 106 rearranges resultant data provided by the processors 110 to 112 according to an instruction stream and writes the rearranged data into the register file 104.
If valid data for a given register number is in the register file 104 and reorder buffer 106, the data is sent to an instruction issuer 107. The instruction issuer 107 issues instructions with source data to the processors 110 to 112 according to the out-of-order instruction issuing technique.
Data from the register file 104 and reorder buffer 106 are stored in an instruction queue (FIG. 2) incorporated in the instruction issuer 107. The instruction queue has, for each instruction, a data field and a validity flag Vi that indicates whether or not data in the data field is valid. If the flag Vi indicates invalidity, the instruction issuer 107 monitors addresses and data coming through a data forwarding path 108.
The processors 110 to 112 simultaneously fetch instructions from the instruction issuer 107, execute them, and send results and their register numbers to the reorder buffer 106 as well as to the instruction queue through the path 108.
This parallel processing unit has a problem that load on the instruction issuer 107 becomes heavier as the number of simultaneously issued instructions increases.
This problem will be explained with reference to FIG. 2, which is a block diagram showing the instruction queue incorporated in the instruction issuer 107.
The instruction queue has comparators whose number is at least equal to the number of the processors 110 to 112. Each of the comparators corresponds to source data for an instruction stored in the instruction queue. In FIG. 2, there are three comparators 201, 202, and 203 for the processors 110, 111, and 112. The comparators 201 to 203 compare resultant data addresses, i.e., resultant data register numbers sent through the path 108 with a source register number of a queued instruction. Based on comparison results, a select signal generator 204 generates a select signal S204. According to the select signal S204, a selector 205 selects a piece of data among those sent through the path 108. The comparison results are also sent to an OR gate 206, which provides an enable signal EN, to set/reset a validity flag Vi of the queued instruction. This arrangement involves a large number of comparators and a long path to select data, to extend a delay time.
Namely, the comparators 201 to 203 compare addresses, i.e., register numbers provided by the processors 110 to 112 with a source data address, i.e., a source data register number of a queued instruction. If one of the register numbers from the processors 110 to 112 agrees with the source data register number, data from the processor that provides the agreed register number is fetched from the path 108 through the selector 205 and is used as source data for the queued instruction. Then, the queued instruction is issued to one of the processors 110 to 112. In practice, data forwarding sources to the instruction queue of the instruction issuer 107 are not only the processors 110 to 112 but also the reorder buffer 106 and dummy pipelines. The dummy pipelines are connected to the processors 110 to 112, to reduce the scale of the reorder buffer 106. In this way, there are many data forwarding sources to the instruction queue.
If the register file 104 is of 32 words, each comparison operation needs a 5-bit comparator. If a register renaming technique for changing addresses is employed to reduce queuing and improve processing efficiency, each comparator must have a larger number of bits. This results in increasing load on the instruction issuer 107, and therefore, the instruction issuer 107 must have larger scale and faster processing speed.