Escalating demands for increased computer processing performance have led to the introduction of highly parallel computer architectures, in which multiple operations are simultaneously executed. Designed to exploit the fine-grained parallelism inherent in high-level language programs, a very long instruction word (VLIW) processor comprises multiple, parallel functional units controlled on a cycle-by-cycle basis by a very long instruction word (i.e., 100 or more bits). Very long instruction words comprise a concatenation of fields or "issue slots," each of which independently specifies the operation of a functional unit. VLIW processors are used in a variety of applications including super-computers and mainframes, workstations and personal computers, and dedicated processors in audio and video consumer products.
FIG. 1 illustrates a conventional VLIW processor 100. On each machine cycle, very long instruction words are loaded or "issued" from an instruction memory 110 into an instruction issue register 120. Instruction memory 110, which can be a random access memory (RAM) or a read only memory (ROM), is typically pipelined and supplemented by an instruction cache (not shown) to enhance execution throughput. Each instruction word loaded into instruction issue register 120 contains a number of issue slots 121-127, each issue slot 121-127 for controlling a corresponding functional unit 131-137 in the VLIW processor 100. In general, a VLIW processor may comprise any useful combination of functional units, and FIG. 1 depicts one such combination. Depending on the particular implementation, there may be more or fewer functional units, and there may be functional units of different types. Functional units may perform a variety of operations, selected by an operation code (opcode); for example, an arithmetic-logic functional unit can add and subtract two values.
Each issue slot 121-127 within an instruction word loaded into the instruction issue register 120 specifies an operation to be started in the current clock cycle for the corresponding functional unit 131-137. In particular, each issue slot 121-127 typically contains an opcode and operands for the corresponding functional unit 131-137. The opcode is useful for functional units that perform a variety of different operations.
Operands for the functional units 131-137 are read from a shared, multi-ported register file 140, and results from the functional unit 131-137 are written into the register file 140.
The term "specification of an operation" as used herein refers to a combination of an opcode, if needed, and operands, if needed, employed to specify an operation of a functional unit. Thus, each issue slot 121-127 contains a specification of operation to be executed by a corresponding functional unit 131-137.
Referring to FIG. 2, issue slot 121 contains a specification of the operation 210 of the constant generation unit 131, namely, a constant value CONSTANT and a register RD to hold the constant value. The specification of the operation 220 of arithmetic-logic units (ALU) 132 and 133 contained in respective issue slots 122 and 123 holds an ALU opcode, two source registers RA and RB, and a destination register RD for the operation of respective. Typical ALU opcodes indicate operations such as addition, subtraction, negation, logical and, logical or, logical exclusive or, logical complementation, and the like. Issue slot 124 holds a specification 240 containing a MUL opcode for multiplication, division, or square root, source registers RA and RB, and a destination register RD for the multiplier unit 134. Issue slot 125 includes specification 250 having an FPU opcode (e.g., addition, subtraction, and comparison), source registers RA and RB, and destination register RD for the floating point unit 135. A data memory unit 136 is controlled by issue slot 126, having a specification 260 including a MEM opcode, indicating a load or store operation, address registers RA and RB, and a data register RD. The jump control unit 137 with reference to specification 270 in the corresponding issue slot 137 uses register RA and RB to indicate a conditional value and a jump destination address within instruction memory 110; the specification 270 also holds a JMP opcode specifying whether to jump always (unconditional jump), jump if the conditional value register is true, jump if the register is false, or not jump at all (NOP). The contents of these issue slots 121-127, the operation of the functional unit 131-137, and the format of specifications 210-270 are to be regarded as exemplary and may be adjusted to suit any useful configuration.
In order for a software program to run on a VLIW machine, a "fine grained parallel" or "instruction level parallel" translation must be found. This is accomplished by a compiler that translates a conventional high-level programming language, such as ANSI-C, into VLIW instructions. Such compilers are described in John R. Ellis, BULLDOG: A Compiler for VLIW Architectures, MIT Press 1985, ISBN 0-262-05034-X. Functional units in conventional VLIW processors are controlled by exactly one issue slot in the instruction word. For example, the VLIW processor 100 depicted in FIG. 1 includes seven issue slots 121-127 corresponding to the seven respective functional units 131-137. Thus, compilers for conventional VLIW processors emit instructions in which a functional unit is controlled by values in a signal issue slot.
High performance processors, including VLIW processors, are subject to the so-called "branch delay" problem caused by the latency of instruction memory, which is the time in machine cycles between the transmission of an instruction address to an instruction memory and the receipt of the corresponding instruction word for execution. During sequential execution, the address of the next instruction can be predicted, allowing the instruction memory to be pipelined for an effective instruction memory latency of one machine cycle. For conditional jumps, however, the address of the next instruction cannot be predicted, because the destination address depends on the outcome of evaluating a conditional expression. Consequently, the branch delay represents a number of machine cycles in which the instruction word at the destination address is not available due to the evaluation of the conditional expression and instruction memory latency. As functional units are made even faster, for example, by aggressive pipelining, this latency increases. In a high-speed, pipelined architecture, it is very undesirable to wait for the instruction word at the destination of the jump instruction to be fetched from instruction memory before continuing execution of the instructions.
One solution to the branch delay problem was suggested by P. Y-T. Hsu, "Highly Concurrent Scalar Processing," Thesis, University of Illinois at Urbana-Champaign, 1986, which involves continuing to issue instructions during the branch delay period for all possible program paths in parallel, insofar as allowed by the limited number of functional units. Each operation, however, is "guarded" by a Boolean expression so that only those operations satisfying the boolean expression or "guard value" are allowed to be performed. These guard values are constructed by the compiler in such a way as to insure that only those operations on the intended program path are executed.
One disadvantage with current VLIW processors is evident when two or more of the parallel program paths in the branch delay period include operations pertaining to the same functional unit, such as a floating point unit 135. Since each functional unit is controlled by exactly one issue slot in the instruction word, operations for the same functional unit along different parallel program paths must be issued in separate instructions. For example, both program paths of a numerical analysis program may employ floating point operations after a conditional branch. In a VLIW processor with one floating point unit 135, only one floating point operation can be issued in each instruction. Thus, the floating point operations along one path, e.g. the "condition true" path, must be issued in different instructions than the floating point operations along the other path, e.g. the "condition false" path. Consequently, extra instructions need to be scheduled by the compiler, increasing code size and reducing execution times.
As depicted in FIG. 1, one conventional approach in alleviating this disadvantage is to provide another implementation of the same functional unit. For example, the conventional VLIW processor 100 includes two arithmetic-logic units 132 and 133. While a second arithmetic-logic may not be prohibitively expensive, a second floating point unit is expensive to implement again on a monolithic semiconductor device in terms of consumption of surface area and power. Other complex functional units, such as multipliers, barrel shifters, and even data memory units, are also expensive to duplicate.