In a conventional computer system, its hardware portion comprises a CPU (Central Processing Unit), memories and the like. The computer system operates through executing instructions. The conventional instruction-set computer includes RISC (Reduced Instruction-Set Computer) and CISC (Complicated Instruction-Set Computer), and the VLIW becomes a more and more popular technology in the field of micro-processor design. Compared with RISC and CISC processors, the VLIW processors have advantages of low cost, low energy consumption, simple structure and high processing speed.
The VLIW processors use fixed-length long instructions composed of several shorter instructions which can be executed in parallel. And, the VLIW processors do not need many complicated control circuits which must be used when Super-scalar processors coordinate to execute in parallel during operation.
Furthermore, the VLIW processors further combine more than two instructions into an instruction packet. A compiler schedules the instruction packet in advance to make the VLIW processors be capable of rapidly executing the instructions in parallel, so that the micro-processors do not need to execute the complicated time sequence analyses which must be completed in the Super-scalar RISC and CISC processors.
A so-called multiple-issue processor allows the processor to execute multiple instructions in one clock cycle. The multiple-issue processors come in two flavors:
1. Super-scalar processors, executing a variable number of instructions per clock cycle and may be either statically scheduled or dynamically scheduled by a compiling apparatus (i.e. by hard-ware and/or software) using techniques such as score boarding.
2. VLIW (Very Long Instruction Word) processors, executing a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet. The VLIW processors are inherently statically scheduled.
A VLIW instruction usually consists of several sub-instructions. Each sub-instruction corresponds to a certain functional unit (i.e. module) in the processor and to a set of operations. For example, on pp. 285-289 of Computer Architecture; a Quantitative Approach (2nd Edition) of Hennessy, John L. and David A. Patterson [1996], Morgan Kaufmann Publishers, Inc., it is pointed out that one VLIW instruction includes two integer operations, two floating-point operations, two memory references and a branch.
A VLIW processor uses multiple, independent functional units, and each functional unit is used to execute one sub-instruction of the VLIW instruction. The parallel scheduling of these operations requires complicated compiling schemes and tools.
FIG. 1 shows the relationship between the VLIW instruction and the VLIW processor. As shown in FIG. 1, the VLIW instruction includes four sub-instructions, which are ADD int a, b; MUL double c, 3.142; READ d, AR0; and BNZ loop, e, respectively. These four sub-instructions correspond to four functional units in the VLIW processors: an integral functional unit (INT FU), a floating-point functional unit (Float FU), a data memory (Data Memory) and a program memory (Program Memory).
The conventional VLIW compiling apparatus translates each instruction and generates machine codes independently, i.e. each instruction corresponds to one VLIW binary instruction with specific length (e.g. 256 bits). The compiling scheme will leads to waste of operations margin, especially in loop structures.
A loop is one of the basic program structures whatever in high and low level languages. In most DSP (Digital Signal Processing) style applications, large amount of loops are used for computations such as filtering, correlation, etc. Actually, the loop structure let processors execute repeating instruction blocks with minimum program memory space.
After the instructions are translated by conventional compiling method, the loop is expressed as machine (binary) instructions. Each binary instruction occupies 256 bits in program memory. If the repeating times of the loop is K, the processor needs K cycles for the implementation of the whole loop structure (assuming the whole loop structure is to execute repeatedly one loop and is a zero-overhead loop). So one of the advantages of the conventional compiling method for loops is to let processor execute a much longer repeating loop structure with a limited program memory space.
For non-VLIW processors, the conventional compiling method can reach optimal result for both program memory space occupation and loop execution efficiency. However, as to VLIW processors, the conventional compiling method cannot guarantee the loop execution efficiency.
It is well known that the quality of the codes generated by the compiler has great effects on its operation performance since the instruction system of the VLIW processor is complicated. Furthermore, since a large number of loops are used in the VLIW codes and the operating time of the loop structure takes the larger portion of the total operating time, the execution efficiency of the loop structure will directly affect the operating efficiency of the whole VLIW processor.
In the case that the VLIW loop is compiled with the conventional compiling method, the execution efficiency of the loop structure is not high so as to cause waste of the loop time, so that it is difficult for the operating efficiency of the whole VLIW processor to satisfy the requirements.
For example, if one loop in a program needs to be repeated for M times, 2(M−1) instruction cycles are wasted in the VLIW processors when the VLIW loop is compiled with conventional compiling method. In the case of having a relatively large M value, the significant reduction on the operation performance will be caused.