1. Field of the Invention
The present invention relates to an instruction processing method and system, and more specifically to an instruction processing method and system capable of increasing the number of instructions which can be simultaneously supplied in a processing unit having a variable length instruction set.
2. Description of Related Art
In general, as technique for elevating the performance of a processing unit, a VLIW (very long instruction word) system and a superscalar system are known. These systems involve a technology for executing instructions in parallel so as to increase the number of instructions which can be executed per one clock, thereby to realize a high performance. For example, in the VLIW system, it is deemed that an instruction given to an ordinary processing unit is called a "small instruction", and a group of instructions obtained by linking a predetermined numbs of "small instructions" under a constant limit, is called a "large instruction". The large instruction is read at one time, and the small instructions included in the large instruction are processed in parallel. The semantics of the VLIW system is that the small instructions included in the same large instruction are processed simultaneously.
Referring to FIG. 1, an example of the format of the VLIW instruction is shown. In the example shown in FIG. 1, instructions of 32 bytes are simultaneously read and executed as one instruction 407. Therefore, the instructions of 32 bytes corresponds to the large instruction, which is divided into some number of small instructions, each of which has an inherent instruction format. In the shown example, the large instruction is divided into six small instructions. In the order counted from a left, the six small instructions include a load/store (L/S) instruction 401, a load/store (L/S) instruction 402, an ALU instruction 403, an ALU instruction 404, an immediate (IMM) instruction 405 and a branch instruction 406. It is guaranteed that these six instructions are simultaneously executed. Here, the load/store instruction is to transfer data from a memory to internal general registers of a processing unit. The ALU instruction is to execute an arithmetic and logic operation between the internal general registers mentioned above. The IMM instruction is to set an operand value to the general register. The branch instruction is to change an address of an instruction to be executed next.
In this VLIW system, an arrangement of instructions and the field length of the small instructions are inherent to each processing unit, a compatibility is very low. Namely, since the large instruction is read and processed at one time, a string of small instructions must be arranged in the order of enabling that all the small instructions included in each one large instruction can be simultaneously executed unconditionally. If the small instructions cannot be simultaneously executed because of a dependence relation between the small instructions, a no-operation instruction is inserted to compensate for the dependence relation between the small instructions which are included in the large instruction and which are simultaneously executed. As mentioned above, the VLIW system is characterized in that instruction codes are arranged to be executed in parallel. This is called a static scheduling.
On the other hand, the superscalar system is a technique for making it possible to execute instructions in parallel, while ensuring compatibility with the conventional processing unit. In the conventional processing unit, instruction codes are regularly arranged to the effect that instructions are sequentially executed, one by one, in the order from an instruction having an small address. Therefore, after execution of a preceding instruction has been executed, it starts to execute a next instruction. There is a case that the result of execution of a preceding instruction is utilized in a succeeding instruction. This use of the result of execution of a preceding instruction is the "dependence relation" mentioned above. However, all the instructions has the above mentioned dependence relation, and even if an instruction having no dependence relation is executed early, no problem occurs.
In the VLIW system as mentioned above, instructions having no dependence relation are arranged in each one large instruction at the time of compiling, for the purpose of realizing the parallel processing. In the superscalar system, on the other hand, a string of instructions are fetched and analyzed in the order to be executed (program order), and instructions having no dependence relation are selected and picked up, and the instructions which can be processed early or in advance, are executed beforehand for the purpose of realizing the parallel processing. This parallel instruction processing system can be exemplified by a system for simultaneously processing a plurality of instructions, which is disclosed by Japanese Patent Application Pre-examination Publication No. JP-A-2-130635. In this disclosed system, a plurality of instructions are simultaneously read out, and the read-out instruction are analyzed by a plurality of decoders, so that only the decoded instructions have no dependence relation, the decoded instructions are executed in parallel by a plurality of processing means.
Referring to FIG. 2, there is shown a block diagram illustrating one example of the superscalar system which is one step ahead of the above mentioned example. This example is used in the IBM 360/91 machine available from International Business Machines Corporation.
A fetch unit 501 fetches and decodes an instruction supplied from a memory (not shown). On the basis of the decode result, the instructions are divided into different kinds of processing units corresponding to different kinds of instructions, and if an empty buffer exists in reservation stations 504 and 508, the fetch unit 501 supplies the corresponding instruction through a bus 507 to the reservation stations 504 and 508. In addition, if a source operand exists in the register file 502, the source operand is stored in the reservation stations 504 and 508. On the other hand, if there is no empty buffer in the reservation stations 504 and 508, the fetch unit 501 supplies no instruction, and stops the supplying of the instructions until an empty buffer occurs in the reservation stations 504 and 508.
For each of the instructions supplied into the reservation stations, if all necessary source operands are not completed, a common data bus is monitored, and if the necessary source operand appears on the common data bus 506, the value of the necessary source operand is written through a source register bus 503 to the reservation stations 504 and 508.
The instruction stored in each of the reservation stations 504 and 508 is supplied to the corresponding processing unit 505 or 509 when all necessary source operands have been completed. When each of the corresponding processing units 505 and 509 has completed execution of the received instruction, the processing unit writes the result of the execution through the common data bus 506 to the register file 502, and simultaneously to supply the same result to the reservation station which may wait for the result of the execution.
The above mentioned mechanism can realize the function to the effect that, if all necessary source operands have not yet been completed, namely, while the operands are not effective, the execution of the instruction is delayed in the reservation station, with the result that the instructions are preferentially executed in the executable order.
On the other hand, in the prior art CISC (complex instruction set computer) type instruction set, a variable length instruction set is used in order to shorten a static size of an instruction code. This variable length instruction set is an instruction set having an instruction format of a word length which is variable dependently upon an instruction. Therefore, if instructions are formatted to the effect that an instruction having a high occurrence frequency is short in length and an instruction having a low occurrence frequency is long in length, it is possible to shorten a static code size of a program. Now, explanation will be described about a procedure for decode a plurality of instruction codes included in the variable length instruction set, in order to execute, in parallel, the plurality of instruction codes included in the variable length instruction set. In the case of realizing an instruction set of a variable length instruction in the superscalar system, difficulty is attributable to an instruction fetch mechanism. In the superscalar system, the larger the number of instructions which can be supplied to execution units at one time becomes, the larger the possibility of parallel execution becomes. However, in the case of an architecture having the variable length instruction set, an instruction analyzing procedure becomes complex when a plurality of instructions are fetched.
In the present RISC (reduced instruction set computer) type instruction set, the instruction has a length of 32 bits and aligned with 4 bytes. Therefore, a starting position of each instruction is perfectly fixed, a parallel analysis of a plurality of instructions becomes possible by starting the instruction analysis from the fixed starting position. In the case of the variable length instruction, on the other hand, the position of a first instruction is definitely known, however, in order to find out a starting position of a second instruction, it is necessary to analyze the first instruction to know the size of the first instruction. In other words, in the variable length instruction, since the starting position of second and succeeding instructions is indefinite, it is necessary to decode the instructions in the order of execution, Therefore, the larger the number of instructions to be analyzed becomes, the larger the number of cascaded logic stages for finding out the start of the instructions becomes, with the result that a delay time increases. Accordingly, when the system is operated with a fast clock, it becomes difficult to decode many instructions.
Actually, when the system is operated with a high speed, if the instructions are decoded sequentially, the number of instructions which can be decoded with one clock, is greatly restricted. In order to speed up this instruction decoding, there is known a method for previously clarifying the partitions or boundaries by predecoding the instructions when the instructions are written into a cache. This method was adopted in a microprocessor available from Advanced Micro Devices, Inc. under the tradename "K5". In the K5 microprocessor, when an instruction is fetched from a memory and written into a on-chip cache, the instruction is decoded, and a byte corresponding to a header of the instruction is marked. This method is very excellent, but a predecode bit must be added to the cache, and in addition, since coincidence between a line of the cache and a branch destination address is not necessarily aligned, when the branch occurs, a control becomes very difficult.
In brief, in order to execute instructions in parallel in the superscalar system, it is necessary to decode a plurality of instructions during one clock. In the conventional RISC processor, since the instruction code is set to a fixed length of 32 bits and aligned with 4 bytes, it is possible to fetch a plurality of instructions at one time and to simultaneously decode the plurality of instructions. However, in the operation processing system for processing a variable length instruction set, even if many instructions are fetched at one time, since the starting position of respective instructions other than the first instruction varies dependently upon the kinds of preceding instructions, it is not possible to decode the many fetched instructions in parallel, and therefore, it is necessary to decode the fetched instructions, one by one, in their order. As a result, it is difficult to decode at a high speed many instructions which are to be simultaneously executed.