1. Field of the Invention
The invention relates to the architecture of very long instruction word (VLIW) processors.
2. Related Art
VLIW CPU's can be used in a variety of applications: from super computers to work stations and personal computers or even as dedicated or programmable processors in work stations, personal computers and video or audio consumer products.
FIG. 1 shows a prior art VLIW arrangement. Instructions are loaded from the Instruction Memory 102 to the Instruction Issue Register IIR. In each clock cycle, a new very long instruction, is transmitted from the instruction issue register IIR. This instruction contains an issue slot for each of the functional units (CONTROL, CONST, ALU1, ALU2, MUL, FPU, MEM) in the VLIW CPU. The VLIW machine may contain any useful combination of function units, the example shown here being only one such combination. There may be more or less functional units and there may be functional units of different types, depending on the desired operations. Operands for the functional units are read from a shared, multi-ported register file 101. Results from the functional units are also written to this file.
The issue slot specifies an operation that is started in the current clock cycle on that functional unit. FIG. 2 shows a VLIW instruction 201 containing a CONTROL operation, a CONST operation, an ALU1 operation, an ALU2 operation, a MUL operation, an FPU operation, and a MEM operation. In other words the VLI14 instruction contains one issue slot for each functional unit ire the VLIW CPU of FIG. 1.
202 shows the contents of the issue slot specifying the CONTROL operation. This issue slot contains a CONTROL opcode and two register specifications, Rsrc1 and Rsrc2, which are source register 1 and source register 2, respectively.
203 shows the contents of the issue slot specifying the CONST operation. This issue slot contains a constant value and an register specification, Rdest, which is the destination register.
204 shows the contents of the issue slots containing the ALU operations. This issue slot contains an ALU opcode and three register specifications, Rsrc1, Rsrc2, and Rdest.
205 shows the contents of the issue slot containing the MUL. operation. This issue slot contains a MUL opcode and three register specifications, Rsrc1, Rsrc2, and Rdest.
206 shows the contents of the issue slot containing the FPU operation. This issue slot contains an FPU opcode and three register specifications, Rsrc1, Rsrc2, and Rdest.
207 shows the contents of issue slot containing the MEM operation. This issue register includes a MEM opcode and two register specifications, Rsrc1 and Rsrc2 or Rdest.
Again the contents of these issue slots are exemplary and may be adjusted to any useful configuration of functional units.
In most prior art machines, an operation can be started on all functional units in each cycle. An operation started in cycle `i` may complete in one cycle or take several cycles to complete. Completion is evidenced by the writing of the result of the operation in the destination register. For operations without result (such as `store` operations), completion is the time at which the state change associated with the operation occurs.
Most of the function units of FIG. 1 are simple, such as the CONST (constant generation unit). This unit produces a constant that is put is put into the destination register. The ALU, MUL and FPU units perform arithmetic, logical and shift operations on one or two arguments and produce a single result in the destination register.
The CONTROL and MEM units are somewhat different.
The CONTROL unit determines the sequence in which instructions are issued. If a NOP (No Operation) is issued on the CONTROL unit, instructions will be issued in sequential order from the Instruction Memory. If a CJMPF or CJMPT (Conditional JuMP False and Conditional JuMP True, respectively) operation is issued on the control unit, the Rsrc1 register contents will be interpreted as having a truth, i.e. boolean, value and the Rsrc2 register content will be used as the address from which instruction issue will continue if and only if the specified condition is met, otherwise instruction issue will proceed sequentially.
The MEM unit performs load and store operations. In other words, it moves data words between the register file and system. main memory. A load operation uses Rsrc1 as the address in main memory of the data word to be loaded, and Rdest identifies the register in which the loaded value is to be stored. A store operation uses the contents of Rsrc1 as the address and the contents of Rsrc2 as the value to be stored. There are, of course, many variants of the load and store operations. Since load instructions do not require Rsrc2 and store instructions do not require Rdest, the issue slot need only contain 2 register fields.
In order for a software program to run on a VLIW machine, a "fine grain parallel" or "instruction level parallel" translation must be found. This is done by a compiler that translates a conventional high-level programming language, such as ANSI-C, into instructions for a VLIW machine. Compilers for use in VLIW machines are described in John R. Ellis, BULLDOG: A compiler for VLIW architectures, MIT Press 1985, ISBN 0-262-05034-X
In order to operate the VLIW of FIG. 1 at its peak processing rate, 1 CONTROL, 1 CONSTANT, 2 INTEGER ALU, 1 INTEGER MULTIPLY, 1 FLOATING POINT, and 1 MEMORY operation must be issued in every cycle. Due to the nature of actual programs expressed in high level languages, it is not possible to find the appropriate mix of operations that will sustain this peak performance. After compiling, the set of operations that could be done in parallel in a given clock cycle are of a type mix that does not match the functional unit types that are available. In some cases, programs go through phases where less parallel operations can be found than the number of units in the machine.
This results in several problems.
First, register file ports are under utilized. The silicon area and power consumption of the central register file are proportional to the total number of ports. Hence it is important that the utilization of such ports be high.
Second, the instruction bandwidth needed to sustain the VLIW CPU at or near its peak performance is high. Empty slots, containing NOP codes, contribute to this bandwidth. The bandwidth translates directly into I-Cache (Instruction Cache) size and cost of the buses and other memory system components.
Third, the size of the code for a program translated for a VLIW is larger than the size of the same program when translated for a RISC style CPU. Even though the performance of a VLIW is higher than that of a RISC, the cost performance ratio of a VLIW is less than that of a RISC.