1. Field of the Invention
The present invention relates to an instruction conversion apparatus, a processor, a storage medium storing parallel execution codes to which a plurality of instructions have been assigned, and a computer-readable storage medium storing an instruction conversion program that generates such parallel execution codes. In particular, the invention relates to a technique for decreasing the number of execution cycles and improving code efficiency by using parallel processing.
2. Description of the Background Art
In recent years, parallel processing methods have been widely used in the development of microprocessors. Parallel processing refers to the execution of a plurality of instructions in each machine cycle. Examples of classic parallel processing techniques are superscalar methods and VLIW (Very Long Instruction Word) methods.
In superscalar methods, specialized circuitry in the processor dynamically analyzes which instructions can be executed in parallel and then has these instructions executed in parallel. These methods have an advantage in that superscalar processors can be made compatible with serial processing methods. This means that object code that has been generated by a compiler for a serial processor can be executed in its original state by a superscalar processor. A disadvantage of superscalar techniques is that specialized hardware needs to be provided in the processor to dynamically analyze the parallelism of instructions, which leads to an increase in hardware costs. Another disadvantage is that the provision of specialized hardware makes it difficult to raise the operation clock frequency.
In VLIW methods, a plurality of instructions that can be executed in parallel are arranged into an executable code of a fixed length, with the instructions in the same executable code being executed in parallel. For VLIW methods, an “executable code” is a unit of data that is fetched from memory in one cycle or is decoded and executed in one cycle.
For VLIW methods, there is no need during execution for the processor to analyze which instructions can be executed in parallel. This means that little hardware is required, and that raising the operation clock frequency is easy. However, the use of fixed-length instructions leads to the problems described below.
In VLIW executable codes, there is a significant variation in the number of bits required to define different kinds of instructions. As examples, instructions that deal with a long constant, such as an address or an immediate, require a large number of bits, while instructions that perform calculations using registers may be defined using fewer bits. As stated above, VLIW deal with executable codes of a fixed length, so that NOP codes need to be inserted into instructions that only require a low number of bits. This increases code size.
To solve this problem, a technique that fetches a fixed amount of code from memory in each cycle but decodes and executes a variable amount of code has been proposed in recent years. Hereafter, this technique will be referred to as the “fixed-supply/variable-execution method”.
FIG. 1A shows the instruction supply unit used in the fixed-supply/variable-execution method. Since there is variation in the number of bits needed to define different instructions, two different formats are used. Instructions that require a large number of bits use a first format composed of two units, units 1 and 2, while instructions that only require few bits use a second format composed of one unit, unit3. Here, instructions that have a length of one unit are called “short instructions”, while instructions that have a length of two units are called “long instructions”.
While there are both short and long instructions, instructions are supplied three units at a time, with no attention being paid to the differences in types.
FIG. 1B shows the units (hereafter called “packets”) for fetching instructions from memory in each cycle in this fixed-supply/variable-execution method. FIG. 1C, meanwhile, shows the minimum units (hereafter called “execution units”) for decoding and execution by this processor.
During execution, all instructions in an area in FIG. 1B demarcated by parallel processing boundaries are executed in parallel in one cycle. This means that in each cycle instructions are executed in parallel as far as the instruction that is set the next parallel processing boundary shown in FIG. 1B using shading. Instructions that have been supplied but are not executed are accumulated in an instruction buffer and are executed in a following cycle.
In FIG. 1B, the parallel processing boundary is set at unit6, so that all units from unit1 to unit6 are set as one execution unit. Of these units, unit1˜unit2, unit3˜unit4, and unit5˜unit6 each compose a long instruction, so that these three long instructions are executed in parallel.
The next parallel processing boundary in FIG. 1B is set at unit11, so that all units from unit7 to unit11 are executed in one execution unit. Of these units, unit7˜unit8 compose a long instruction, unit9 composes a short instruction, and unit10˜unit11 compose a long instruction. These three instructions are executed in parallel.
In this method, instructions are supplied using a fixed-length packet, and a suitable number of units is issued in each cycle based on information that is found through static analysis. Using this method, there is absolutely no need to insert the no operation instructions (NOP codes) that are required in conventional VLIW methods with fixed length instructions. As a result, code size can be reduced.
The following describes the hardware construction of a processor for this fixed-supply/variable-execution method.
FIG. 2 is a block diagram showing the construction of the instruction register and periphery in a processor that is capable of executing three instructions in parallel. The broken lines in FIG. 2 show the control flows. The unit queue in FIG. 2 is a sequence of units. These units are transferred to the instruction registers in the order in which they were supplied from the instruction memory (or similar).
In this construction, the instruction register A 52a and the instruction register B 52b form one pair, as do the instruction register C 52c˜the instruction register D 52d and the instruction register E 52e˜the instruction register F 52f. Instructions are always arranged so as to start from one of the instruction register A 52a, the instruction register C 52c, and the instruction register E 52e. Only when an instruction is formed of two linked units is part of the instruction sent to the other instruction register in a pair. As a result, when the unit transferred to the instruction register 52a is a complete instruction in itself, no unit is transferred to the instruction register B 52b.
The main characteristic of the above processor is that parallel processing can be performed for any combination of short and long instructions.
When three long instructions are to be executed in parallel, the three long instructions will be composed of three pairs unit1˜unit2, unit3˜unit4, and unit5˜unit6 in the unit queue 50. The present processor stores the first long instruction in the pair of the instruction register A 52a˜instruction register B 52b, the second long instruction in the pair of the instruction register C 52c˜instruction register D 52d, and the third long instruction in the pair of the instruction register E 52e˜instruction register F 52f. After being stored in this way, the three long instructions are executed by the first instruction decoder 53a˜third instruction decoder 53c.
When the three instructions to be executed in parallel are the long instruction composed of unit1˜unit2, the short instruction composed of unit3, and the long instruction composed of unit5˜unit6, the present processor stores the first instruction in the pair of the instruction register A 52a˜instruction register B 52b, the second instruction in the instruction register C 52c, and the third instruction in the pair of the instruction register E 52e˜instruction register F 52f. Nothing is stored in the instruction register D 52d. After being stored in this way, the three instructions are executed by the first instruction decoder 53a˜third instruction decoder 53c.
When unit1˜unit2 and unit3˜unit4 in the unit queue 50 compose two long instructions and unit5 composes one short instruction, the present processor stores the first instruction in the pair of the instruction register A 52a˜instruction register B 52b, the second instruction in the pair of the instruction register C 52c˜instruction register D 52d, and the third instruction in the instruction register E 52e. Nothing is stored in the instruction register F 52f. After being stored in this way, the three instructions are executed by the first instruction decoder 53a˜third instruction decoder 53c.
As should be clear from the above description, there is no universal definition of the instruction register to which each unit is the unit queue is to be transferred. There is also no universal definition of the units in the unit queue that are to be transferred to each instruction register. For this reason, the selectors 51a˜51d are provided to determine the destinations of units transferred from the unit queue. These selectors 51a˜51d are controlled in the following way. First, control is performed to determine the output destination of selectors 51a and 51b, and the units to be transferred to the instruction registers C 52c˜instruction register D 52d are determined. Once the units to be transferred have been determined, information regarding the length of the instruction in the unit transferred to the instruction register C 52c is examined and control is performed as shown by the broken lines in FIG. 2 to determine the output destinations of the selectors 51c and 51d.
While the above processor can decode instructions regardless of the combination of short and long instructions and regardless of how the opcodes are located in the units, the bit width of the input ports for the first-third instruction decoders 53a˜53c is two units, which increases the overall hardware scale. Putting this another way, the processor is deficient in having an overly large hardware scale. The processor includes selectors that switch the output destinations of the instructions after referring to information regarding the lengths of the instructions in the units that are transferred to the instruction registers, so that the hardware construction becomes increasingly complex as the number of instruction to be executed in parallel increases.
One conventional method for reducing hardware scale is that described for the GMICRO/400 processor in the article The Approach to Multiple Instruction Execution in the GMICRO/400 Processor given in PROCEEDINGS, The Eighth TRON Project Symposium (International) 1991.
FIG. 3A is a block diagram showing the construction of the instruction register and periphery for the instruction issuing control method used by the GMICRO/400 processor. In FIG. 3A, the broken lines show the control flows. The constant operands 54a˜54b are indicated by the output of the first instruction decoder 53i˜the third instruction decoder 53k. Each instruction decoder decodes an inputted instruction and outputs signals to the execution control unit to control the execution of the instruction, as well as outputting the constant operands indicated in the instruction.
The instruction issuing control method of the GMICRO/400 processor decodes the combination unit1˜unit2, and unit2 and unit3 separately. After the decoding of the first instruction decoder 53i has clarified whether the first instruction is a one-unit instruction or a two-unit instruction, the selector 51g is controlled so that the decoding result of only one of the second instruction decoder 53j and the third instruction decoder 53k is selected and used. As a result, the processor can execute both instructions in either the short instruction-short instruction combination or the short instruction-long instruction combination of FIG. 3B in parallel.
As shown in FIG. 3A, the GMICRO/400 decreases the number of instructions that can be executed in parallel from three to two, so that only two decoders are provided2. The second instruction decoder 53j and the third instruction decoder 53k also have input ports that are only one unit wide, so that hardware reductions can be made.
2Transistor's note: Apparent mistake in the original Japanese. Three decoders are present. 
The above processor has a different problem, however, in that despite being equipped with three decoders, only two instructions can be executed in parallel, representing a marked decrease in parallelism when compared with the hardware shown in FIG. 2. The second of the two instructions that can be processed in parallel is also limited to one unit, giving rise to the further restriction of short instruction-long instruction combinations also be prohibited.