The fundamental structure of a modern computer includes peripheral devices to communicate information to and from the outside world; such peripheral devices may be keyboards, monitors, tape drives, communication lines coupled to a network, etc. Within the computer is the hardware necessary to receive, process, and deliver this information to and from the outside world, including busses, memory units, input/output (I/O) controllers, storage devices, and at least one central processing unit (CPU). The CPU and other processors execute instructions of computer application programs and direct the operation of all other system components. Processors actually perform primitive operations, such as logical comparisons, arithmetic, and movement of data from one location to another, quickly. What may be perceived by the user as a new or improved capability of a computer system may actually be the processor(s) performing these same simple operations much faster. Continuing improvements to computer systems, therefore, require that these systems be made even faster.
One measurement of the overall speed of a computer system, also called the "throughput", is measured as the number of operations a processor performs per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, particularly the clock speed of the processor. If everything runs twice as fast but otherwise works in exactly the same manner, the system then performs a given task in half the time. Computer processors, which were constructed from discrete components years ago, performed significantly faster by shrinking the size and reducing the number of components so that eventually the entire processor was packaged as an integrated circuit on a single chip. The reduced size made it possible to increase the clock speed of the processor, and accordingly increase system speed.
Despite the enormous improvement in speed obtained from integrated circuitry, the demand for ever faster computer systems still exists. Hardware designers have been able to obtain still further improvements in speed by greater integration, by further reducing the size of the circuits, and by various other techniques. Designers, however, know that physical size reductions cannot continue indefinitely, and there are limits to continually increasing processor clock speeds. Attention has therefore been directed to other approaches, including new computer architectures, for further improvements in overall speed of the computer system.
The modest cost of packaging individual processors on integrated circuit chips has it practicable to improve system speed using multiple processors without changing the clock speed. In addition, the system speed significantly improves by off-loading work from the CPU to the slave processors having limited functions. For instance, slave processors routinely execute repetitive and single special purpose programs, such as input/output device communications and control. It is also possible for multiple CPUs to be placed in a single computer system, typically a host-based system which services multiple users simultaneously. Each of the different CPUs can simultaneously execute a different task on behalf of a different user to increase the overall speed of the system. This technique, shown in FIG. 1, illustrates several processors labeled CPU 1, CPU 2, . . . connected by a communications network and controlled so that more than one processor may execute different tasks at the same time. Each short horizontal line under a task represents an instruction, with many instructions per task. In a real situation there probably would be many more instructions per task than shown in FIG. 1. Each CPU executes one instruction at a time so with more than one CPU executing instructions simultaneously, the parallel processor saves elapsed time. There is significant overhead, however, to start all the separate tasks in separate processors, to synchronize and communicate among tasks and to assemble their partial results to generate the overall result. To use this kind of traditional parallel processor on a particular application, a programmer or a sophisticated compiler must break the problem into pieces and set up appropriate communications and controls. If this overhead consumes more time than was saved by parallel execution, the parallel processor approach is limited. This traditional parallelism offers most cycle savings to problems which divide naturally into large pieces which have little need to communicate with each other, such as scientific numerical methods and other highly structured problems. Practicably, however, there is limited application for parallel processing on multiple CPUs for problems which have unpredictable pathways of execution and/or which require extensive sharing and communication among the processors.
Computer architectures of the reduced instruction-set computers (RISC), superscalars, and very long instruction word (VLIW) machines are based on the premise that the simpler the instruction set, the more efficiently it can be implemented by hardware. These architectures have multiple execution units and multiway branching mechanisms for parallel processing of application code. These architectures, moreover, stimulated the development of compiler technology to take advantage of the parallelism available in an application without resorting to special languages that express this parallelism in a highly optimized code. During the compilation process as many decisions as possible are made to free the hardware from making decisions during program execution.
Another approach is a hybrid in which a single CPU has characteristics of both a uniprocessor and a parallel machine to implement fine-grained parallelism. In this approach, a single instruction register and instruction sequence unit execute programs under a single flow of control, but multiple arithmetic/logic units (ALUs) within the CPU can perform multiple primitive operations simultaneously. Rather than relying on hardware to determine which operations can executed simultaneously, a compiler formats the instructions to specify the parallel operations before execution. The superscalar computer which typically executes up to four instructions per processor clock cycle. In addition, extending the instruction word held in the instruction register to specify multiple independent operations to be performed by the different ALUs requires a very long instruction word. The Very Long Instruction Word (VLIW) computer may execute sixteen instructions or more per processor cycle.
Several academic papers suggest that a VLIW architecture can achieve greater parallelism and greater speed than multiple independent processors operating in parallel in many applications. Shown in FIG. 2 is a model of an exemplary VLIW computer having fine-grained parallelism at the level of machine instructions within a task. As can be seen, a typical application program has a single flow of control indicated by the time line along the left of the figure, but primitive operations within that flow are performed in parallel. The VLIW compiler discovers the primitive operations within a program can be performed simultaneously and then compiles the instructions for these operations into a compound instruction, the very long instruction word, hence the vernacular for the computer architecture and for the instruction: VLIW. An automated compiler for a VLIW machine, therefore, does not have to alter program flow which is something that has been almost impossible to automate in parallel processor machines; the compiler for a VLIW machine has only to determine which primitive operations can be performed in parallel and create the compound instructions executed by the hardware. A well-written compiler, moreover, generates an instruction stream to optimize the useful work of separate hardware units during as many machine clock cycles as possible. A primitive is that portion of a VLIW instruction which controls a separate hardware unit. These separate hardware units within the CPU include arithmetic logic units (ALU), including floating point units which perform exponential arithmetic, register to storage (RS) units which provides a direct path to memory storage outside the CPU, and register to register (RR) units which provides a direct path to another register in the processor. Thus in one cycle, all these separate resources within a VLIW machine can be used, so several basic machine instructions can execute simultaneously. The advantage is that a task can be completed in fewer machine cycles than is possible on a traditional uniprocessor, in other words, the "turnaround time" from task initiation to task completion is reduced and its results are available sooner. For transaction processing applications, where each transaction requires a long sequential series of operations, and communication between transaction processing tasks is negligible, this idea has a natural advantage over traditional parallelism.
The size and format of the VLIW present special considerations. The expected size of the instruction word imposes significant burdens on the supporting hardware outside the CPU, such as memory, instruction cache, buses, etc. There are several reasons for the large instruction word in the VLIW design. Recall that a VLIW requires that multiple hardware units operate simultaneously to perform parallel operations. Each of these hardware units requires its own command, which includes an operation code, source and destination designations, etc. Further there must be a mechanism to determine the next instruction to execute. This determination, often called control flow, presents its own peculiarities in any computer but these peculiarities greatly increase in a VLIW computer. Control flow is said to jump to the next instruction when there is no choice or condition which determines the next instruction. Control flow branches to the next instruction when the change of control flow is conditional. By far, conditional branching constitutes the dominant mechanism to change control flow in most computer architectures, including VLIW.
In order to utilize the conditional branching capabilities, the compiler resolves all conditional branch statements into two component parts: instruction(s) which perform a test and set of a condition register; and a branch instruction which tests a condition register previously set. Without violating dependency relationships, the compiler schedules the instructions which test and set the condition registers to their earliest possible execution times. When condition registers that determine a possible path through a branch tree have been set, the compiler may then form a branch conditional instruction in the VLIW testing up to sixteen condition registers. The compiler then schedules as many instructions as possible that lie along the branch path into this VLIW. The above step is repeated until non-dependent instructions on up to any number, preferably six or eight, branch paths plus a sequential path are formed into a single VLIW.
Evaluating the conditions to determine the appropriate branch is traditionally accomplished in a few ways. The first method uses special bits in the executing instruction, called a condition code which can be implicitly or explicitly set. Implicitly set condition codes increase the difficulty of finding when a branch has been decided and the difficulty of scheduling branch delays. This makes it hard to find instructions that can be scheduled between the condition evaluation and the branch, especially when the condition code is set by a large or haphazardly chosen subset of instructions. Many new architectures avoid condition codes or set them explicitly under the control of a bit in the instruction.
Another technique to test branch conditions involves simple equality or inequality tests, e.g., (a) if the result of adding A+B is greater than or equal to zero, then go to C; or (b) if the result of a logical instruction is equal/not equal to TRUE/FALSE, go to D, see e.g., FIG. 10 and its accompanying discussion. Usually simple compare and branch instructions are part of the branch, but for more complex comparisons, such as floating-point comparisons, a condition register may be implemented wherein the result of the comparison is tested with a value stored in a register and a branch is specified depending upon the comparison. In any event, determining the next instruction by evaluating branch conditions is not a trivial undertaking. The problems are compounded n-fold in a VLIW because there may be n branch conditions needed to be evaluated to determine the next VLIW.
Typical operating systems and other types of code have a program structure characterized by sequences of a few instructions separated by branches. For processors with multiple parallel execution units, such as VLIW, the capacity for a compiler to fill the available execution units by manipulating the code is severely restricted if only one branch can be made per VLIW instruction cycle. The problem is to minimize the impact of complex branching structures not only on a compiler's capacity for optimizing code movement, but also on the critical path and cycle time of the VLIW processor hardware. An N-way VLIW or superscalar processor where N is the number of branches and is large, i.e., greater than or equal to sixteen, faces the problem of very likely having to branch almost every cycle with up to eight-way branching possible and three- to four-way branching probable. If the access time to critical registers and caches for instructions and data within the processor complex is taken to be approximately the cycle time plus clock and latch overhead of a superscalar or a VLIW processor, then all current reduced instruction set computer (RISC) architectures already require two cycles latency per branch taken. At two cycles per iteration for resolving and fetching branches, the effective execution rate of the processor is reduced to one-half of what it would have been without branches. An extra cycle or two for missed prediction of branches, moreover, might be common and average branch taken latency could approach three cycles. This is an unacceptable penalty. To further compound the problem of branch prediction in a VLIW machine, a requirement to predict eight simultaneous branches with a total accuracy of ninety-five percent is an impossible task. Hypothetically, in a sixteen-way parallel processor with eight possible branch targets, either an eight-port instruction cache must be implemented which is very expensive, or some type of branch prediction scheme must be used.
As said before, the power of a VLIW machine is to evaluate which primitives can be executed in parallel and then execute all primitives in parallel and then select the next VLIW for future processing. It is, therefore, necessary to permit conditional branching to multiple destinations from a single VLIW, a characteristic refereed to as N-way branching. All of the branch conditions and destinations, moreover, must be specified in the instruction.
Joseph Fisher and his group at Yale observed that branches in a particular program follow a predictable path a high percentage of the time. Fisher et al. formulated the Extra Long Instruction (ELI), and measured the most common execution paths and designed branching mechanisms to determine if these paths could be executed simultaneously. The trace scheduling techniques Mr. Fisher created, disclosed in U.S. Pat. No. 4,833,599 entitled "Hierarchical Priority Branch Handling for Parallel Execution in a Parallel Processor" to Colwell et al., and Multiflow Corporations's VLIW processor require compile time prediction of the most probable path through a program branch tree. For scientific computing having a high degree of predictability, this approach works reasonably well. If, however, the code would deviate from a predictable path, a high penalty would be paid. Other code structures, such as in commercial applications and operating systems, do not have predictable branching so the trace scheduling techniques disclosed by Fisher leads to poor utilization of the available parallel resources in the VLIW unit.
Percolation scheduling is another technique to be used for commercial and operating systems where many branches are simultaneously executed. Kemal Ebcioglu created this system where there is no prediction for branch path. FIG. 3 illustrates how percolation moves instructions up in the instruction stream so they are executed as soon as their data are available. The dotted arrows in FIG. 3 show movements of instructions from positions in an original stream of von Neumann execution to final positions as parcels in VLIW instructions. One instruction 300 has been moved above a conditional branch 320; in its original position in the instruction stream the execution of instruction 300 is dependent on the result of a conditional branch 320, but in the resulting VLIW instruction stream, instruction 300 will be executed before the branch condition is tested. This is an example of speculative execution which means that the work is done before it is known whether or not the work is necessary. When a VLIW machine has enough resources to accomplish this speculative work without holding up any other work, then there is a gain whenever the speculative work is later determined to be necessary. With more and more resources, speculative execution becomes more and more powerful. For example, if a program splits into two legs after a conditional branch and there are enough resources to move some instructions from both legs up above the branch point, then speculative execution definitely reduces the time needed to finish this program. See Silberman, Gabriel M. and Ebcioglu, An Architectural Framework for Supporting Heterogeneous Instruction-Set Architectures, IEEE COMPUTER 39-56 (June 1993).