1. Field of the Invention
The present invention relates to very long instruction word (VLIW) processors and more particularly to a method and apparatus for a conditional control head instruction for VLIW processors.
2. Description of the Prior Art
Increasing demand for computer-processing performance has been met, at least in part, with computers that are able to employ instruction level parallelism (ILP), meaning that the computer can execute a plurality of instructions simultaneously. A very long instruction word (VLIW) processor achieves this using very long instruction words. VLIW processors are employed in super-computers, mainframes, and many other applications where high-performance processing power is required.
Each VLIW comprises a plurality of fields, or “slots”. Each slot is designed to comprise a single, basic instruction comprising an operational code (opcode) and any associated operands. The number of slots in the VLIW is typically machine architectural dependent and depends on the number of functional units (FU), such as arithmetic and logic units (ALU) and floating point unit(s) (FPU) in the machine. Each slot corresponds to a specific ALU or FPU. The ALU performs operations such as addition, subtraction, and multiplication of integers and bit-wise and other Boolean operations. The FPU performs floating-point operations and due to cost, only one is generally utilized in a CPU. When the VLIW is executed, the FUs execute the operation indicated by the opcode in the corresponding slot.
In a conventional VLIW processor, one VLIW is transferred from memory to a pipeline during each machine cycle. The use of instruction pipelining is well known in the art and proven extremely effective in increasing throughput when the executing program code merely comprises a series of instructions to be executed sequentially. Each stage in the pipeline performs a dedicated functional step related to the execution of the instruction, such as fetching a value of an operand from memory. During sequential execution, the next instruction to be executed is known and can be transferred to the pipeline one machine cycle after the current instruction was transferred. Therefore, even though each instruction may require several steps to complete, once the pipeline is full, one instruction can be completed with each machine cycle.
However, most program code also comprises conditional instructions, for example an “if” statement, where it is unknown which specific instruction is to be executed next until after the conditional instruction has been completed. Because the next instruction to be executed is unknown, it is difficult or impossible to keep the pipeline full of sequentially required instructions, resulting in each conditional instruction slowing throughput.
Please refer to FIG. 1 of a sample program segment with a conditional “if” statement 12 written in the C programming language. FIG. 2 illustrates how a conventional VLIW compiler may generate the assembly code for the program segment in FIG. 1. Note the use of multiple slots in lines 36 and 44 where a double vertical line indicates a slot boundary. In FIG. 2, it is not known if the flow of execution is to be lines 32->34->36->38->40->48 or lines 32->34->42->44->46-48 until after the expression (CMPGT R0, 0) in line 32 has been evaluated. If the condition (CMPGT R0, 0) is true, one set of instructions is needed. If the condition (CMPGT R0, 0) is false, a different set of instructions is needed. This ambiguity is known as a “branch delay” problem and the delay results from possibly having to flush the pipeline and wait for the transfer of the next correct instruction to the pipeline after the instruction in line 12 has been completed, a very undesirable result in high-performance processors.
Different approaches to the branch delay problem have been advanced. One common method is to attempt to predict the most likely instruction to be needed following a conditional statement based on the history of execution of a particular program. For example, if “R0” has always been greater than “0” before, “R0” will be greater than “0” is predicted when line 34 is encountered during program execution. Based on this prediction, the instructions in lines 36, 38, 40, and 48 are loaded into the pipeline immediately after the instruction in line 34. If the prediction turns out correct, the branch delay problem has been circumvented. However, if the prediction turns out to be incorrect, the pipeline must be flushed to clear unwanted instructions and the correct instructions in lines 42, 44, 46, and 48 must be transferred to the flushed pipeline and time is wasted waiting for the correct instructions to work their way through the pipeline.
A second advanced approach to the branch delay problem loads all possible sequences of instructions, each sequence corresponding to one possible result of a conditional expression such as indicated in line 34. Thus, in this method, all of the instructions in lines 36-48 are transferred to the pipeline. Each operation in the VLIW comprises not only the opcode and related operands, but also a flag of one or more bits indicating the operation belongs to one specific possible program branch. In FIG. 2 for an example of this approach, the flag with each operation in the lines 36-40 may be equal to a “1”, meaning branch “1”, and the flag with each operation in the lines 42-46 may be set to “0”. If the condition (CMPGT R0, 0) in line 32 turns out to be true, only the instructions in slots that have a flag equal to “1” will be executed. If the condition (CMPGT R0, 0) in line 32 turns out to be false, only the instructions in slots that have a flag equal to “0” will be executed. While this second approach helps to keep the pipeline full, it requires that each slot in each VLIW include extra room for the flag bits. The number of bits required depends on the number of program branches possible at any given time during program execution.
Therefore, the prior art still lacks a solution to the branch delay problem. The predictive approach works sporadically and the flag bits approach requires additional bits to be stored with each operation in each VLIW instruction, bloating program size.