Typical general purpose computer systems comprise one of many different architectures. Architecture, as used herein, refers to the instruction set and resources available to a programmer for a particular computer system. Thus, architecture includes instruction formats, instruction semantics, operation definitions, registers, memory addressing modes, address space characteristics, etc. An implementation is a hardware design or system that realizes the operations specified by the architecture. The implementation determines the characteristics of a microprocessor that are most often measured, e.g. price, performance, power consumption, heat dissipation, pin number, operating frequency, etc. Thus, a range of implementations of a particular architecture can be built, but the architecture influences the quality and cost-effectiveness of those implementations. The influence is exerted largely in the trade-offs that must be made to accommodate the complexity associated with the instruction set.
Most architectures try to increase efficiency in their respective implementations by exploiting some form of parallelism. For example, in single instruction multiple data stream (SIMD) architecture implementations, the various processing elements (PEs) can all perform the same operation at the same time, each with its own local (different) data.
One common architecture is the very long instruction word (VLIW) architecture. Although very similar to SIMD systems, in VLIW system, each PE can perform a different operation independent of the other PEs. However, the grouping of the sets of operations that PEs can execute together is static. In other words, the choice of which operations that can simultaneously execute together is made at compile time. Moreover, their execution is synchronous. This means that each of the PEs is processing the instructions in a lock-step manner. Note that VLIW PEs are sometimes referred to as function units (FUs), because some PEs within a VLIW system may support only certain types of operations.
VLIW processors are wide-issue processors that use static scheduling to orchestrate the execution of a number of parallel processor elements. VLIW processors have constant branch latency. When a branch is executed on one of a multiplicity of processing elements, the effect of the branch occurs simultaneously on all processor elements. That is, if a branch issued on the t-th cycle and the branch latency is q cycles, all function units begin execution of code at the branch target at cycle t+q.
VLIW processors use a program counter to index into an instruction memory to select an instruction that is used to simultaneously control all processing elements. Since the instruction is taken from the instruction memory in a single atomic action, program text that is taken from the instruction memory after a branch is available to all function units on the same program cycle.
Since the processing elements are physically separate from each other, a branch that occurs in an originating processing element may take different amounts of time to reach each of the processing elements. VLIW scheduling requires that the branch would have to take effect at all processing elements at the same time. Thus, the branch command at the closest processing element would have to be delayed for a time equal to the branch delay to the farthest processing element.
FIG. 4 depicts a conventional scheduling model that has uniform latency. Basic blocks of program code are labeled block 1 401, block 2 402, and block 3 403. Increasing time is represented by moving downward one row for each clock cycle. This VLIW system is a 5-way VLIW processor and has five processing elements operating on the basic blocks, as represented by the columns. This system has a branch latency of 3 cycles, thus it takes 2 additional cycles after a branch is encountered for all of the processing elements to receive the branch. As shown in FIG. 4, the first processing element forms a conditional branch B 404 while processing the 6th cycle of basic block 1 401. This conditional branch may lead to processing of block 2 402 or block 3 403, depending upon whether the condition is satisfied or not. For example, block 2 402 may be processed if the condition is not satisfied, i.e. falls through, and block 3 403 may be processed if the condition is met, i.e. branch taken. Note that the fall-though block is normally contiguous in memory with the prior block from which the branch was issued.
In any event, the first processing element cannot immediately begin execution on either block 2 or block 3, but rather must wait for 2 cycles until all of the other processing elements are ready to move to the next block. This two cycle branch delay causes a gap between the location of the branch in the code and the actual end of the basic block. This gap is called the branch shadow. Operations within the two cycle branch shadow execute unconditionally as if the branch has not yet been executed. Useful operations which should execute irrespective of the branch condition or no-ops 406 may be executed within the branch shadow, however, for any operation that should not execute when the branch is taken, the operation must appear below this two cycle window.
After waiting, all of the processing elements then move to either block 2 or block 3. For example, the first processing element starts execution at either location 405 or 407, of blocks 2 or 3, respectively. During processing of block 2 or block 3, another branch would be encountered, e.g. branch 408 or 409, which would change the flow of the program to other basic blocks (not shown). Again, because of the three cycle branch latency, the branch originating processing element would wait for two additional cycles before processing the subsequent basic block so that the other processing elements would have received the branch.
A problem with this arrangement is the cycles lost to waiting for the branch latency. For programs with a great deal of branches, these lost cycles can greatly reduce the efficiency of the system.