The present invention generally relates to a microprocessor with very long instruction word (VLIW), superscalar or out-of-order completion architecture. More particularly, the present invention relates to program translator and processor realizing parallel processing down to the level of individual instructions by making efficient use of execution units.
In recent years, various microprocessors, such as VLIW, superscalar and out-of-order completion types, have been developed one after another to execute multiple instructions at a time more rapidly.
Some of compilers, which designate a VLIW microprocessor as a target, define an instruction set and then parallelize the instructions included in the set in such a manner as to satisfy various constraints concerning the availability of execution units of the microprocessor or instruction slots of a long instruction word.
A program translator of this type is disclosed, for example, in Japanese Laid-Open Publication No. 5-265769.
If a source program shown at the top of FIG. 6 is compiled using a prior art program translator, an instruction set shown in the middle of FIG. 6 is generated from the source program. Next, the instructions included in this instruction set are parallelized to generate a set of long instruction words with a step number of 2 as shown at the bottom of FIG. 6. In the second instruction slot of each long instruction word, a no-operation instruction (NOP) is inserted.
Also, if a program shown in FIG. 25 is executed using a conventional superscalar processor, then the processor executes the instructions in 5 cycles by pipelining shown in FIG. 34.
Furthermore, if a program shown in FIG. 31 is executed using another conventional processor including a multiplier that can perform multiplication in 3 cycles, then the processor executes the instructions in 7 cycles by pipelining shown in FIG. 35.
The prior art program translators, however, have various shortcomings. For example, an instruction set generated from source program is not always executable at a high parallelism level because some constraints are often imposed by a processor with limited execution units as targets. Accordingly, many NOP's should be inserted to parallelize the instructions, thus constituting a serious obstacle to performance enhancement.
Also, in the prior art superscalar processor, even if multiple instructions are decoded at a time, just part of these instructions are executable because available execution units are limited. Thus, the resultant performance is not fully satisfactory, either.
Furthermore, in still another prior art processor, if an execution unit should perform a sequence of operations each taking several clock cycles to execute, then succeeding operations cannot be started until these operations are completed. As a result, the performance of such a process is not so good.