1. Field of the Invention
The invention relates to digital computers having at least some capability to perform two or more operations in parallel, and more particularly to a computer such as a pipelined system which is adapted to initiate at least one branching operation and at least one nonbranching operation in a machine cycle, and which may exhibit a branch delay of at least one machine cycle.
2. Description of the Related Art
It has become increasingly clear that the parallelism present in many ordinary non-numeric programs is fine-grain in that it is limited to a relatively confined portion of the program, i.e., parallelism is largely local. Parallel processing architectures, on the other hand, are at their best when handling programs with large scale, coarse grain parallelisms extending over processing steps of significant length, such as are found in many scientific problems and in communications processing. For these reasons, parallel processing architectures have proven to be ill-suited for speeding up the execution of many non-numeric or algorithmic programs.
One highly-parallel computer architecture called Very Long Instruction Word (VLIW) architecture has been proposed by J. A. Fisher in "Very Long Instruction Word Architectures and the ELI-512," Proceedings of the 10th Annual Symposium on Computer Architecture, June 1983, to exploit the modest fine-grain parallelism inherent in ordinary high level language programs. A VLIW machine consists of multiple independent functional units controlled on a cycle-by-cycle basis by a Very Long Instruction Word (100 or more bits). All of the functional units can be arbitrarily pipelined, i.e., they can start a new operation every cycle and take a fixed number of cycles to complete an operation, although the number of cycles for completion can vary from one functional unit to another. The pipeline stages of all units operate in lock step, controlled by a single global clock. The VLIW instruction is the concatenation of a plurality of operation subfields, one for each functional unit to be controlled.
All functional units are connected to a shared multiport register file from which they take their operands and into which they write their results. Any previously computed result can therefore be used as the operand for any functional unit. A VLIW instruction is loaded every cycle. Each functional unit is controlled during that cycle by its own control field which identifies the source and the destination locations in the multi-port register file, and the operation to be started. A typical architecture includes a plurality of arithmetic and logic units, a plurality of memory interface units and a branching control unit. All three of these types of functional units are pipelined to maximize the speed of operation. Any type of functional units can be provided, however, depending on the functions required for the particular application. Barrel shifters, multipliers, and any other functional units can be included provided they have a pipelined organization which permits an operation to be initiated every machine cycle. An "operation" in a VLIW machine is a primitive action taken by a single functional unit under control of the corresponding field of the VLIW instruction. A VLIW "instruction," then, is a concatenation of a plurality of such operation fields, to control the operation of all of the functional units in the architecture in parallel.
In order to efficiently program highly parallel machines such as VLIW machines, a compiler technique called trace scheduling has been used, as taught by R. P. Colwell et al. in "A VLIW Architecture for a Trace Scheduling Compiler," Proceedings of ASPLOS, 1987. A trace scheduling compiler takes as its input the instructions of a program and an execution profile indicating the likelihood of execution of each different path in the program. The trace scheduling compiler uses these inputs to construct a "trace" of the instruction path most likely to be executed. This trace is then scheduled to execute in parallel as much as possible, using all of the arithmetic and control units available in the VLIW machine.
For very high performance pipelined processors, including VLIW machines, a solution to the so-called "branch delay" problem is required. As a system is made faster by aggressive pipelining, the latency of the instruction memory increases. This latency is the time, in machine cycles, between the transmission of an instruction address to the instruction memory and tile receipt of that instruction from the instruction memory for execution. For a conditional jump operation, the time required to evaluate the branching conditions must be added to this latency time. The total time is called the "branch delay." The branch delay represents a number of machine cycles following a jump operation during which the execution of instructions is not affected by the outcome of the jump operation. In a high speed pipelined architecture, it is very undesirable to simply wait before continuing the execution of instructions.
One solution to the branch delay problem was suggested in "Highly Concurrent Scalar Processing," by P. Y-T. Hsu, Thesis, University of Illinois at Urbana-Champaign, 1986. His solution involves executing, during the branch delay period, all possible program paths in parallel (insofar as allowed by the limited number of functional units), but to "guard" each operation by means of a Boolean expression. Only the operations satisfying the Boolean expression are permitted to affect the state of the VLIW machine. Such Boolean expressions are constructed by the compiler in such a way as to insure that only those operations on the intended program path are executed. Hsu proposes to equip each functional unit with extra hardware to evaluate normal form Boolean expressions with a fixed number of factors, for example, three. This would allow the evaluation of expressions of the form a & b & c or a & b & c, etc., where the factors a, b and c are outcomes of three different jump condition evaluations. The hardware to evaluate such expressions is relatively simple. However, the cost of expanding the multiported register file with the read ports necessary to access all of the factors for each functional unit is prohibitive.
The problem, then, is to maximize the throughput of highly pipelined computer processor architectures, such as VLIW architectures, in the presence of significant branch delays. More particularly, the major problem with the prior art Boolean guard expression solution to the branch delay problem is the high cost of the larger multiported register file.