1. Field of the Invention
The present invention relates to processor architectures, and more particularly to processor architectures amenable to software pipelining techniques.
2. Description of the Related Art
Very Long Instruction Word (VLIW) processor architectures represent a design approach for exploiting instruction level parallelism in which the bulk of instruction scheduling and parallel dispatch is relegated to a compiler. In some ways, VLIW is a natural successor to RISC, furthering a trend toward moving complexity from the hardware to a compiler so as to enable simpler, faster processors. See Gwennap, VLIW: The Wave of the Future, Microprocessor Report, Feb. 14, 1994, pp. 18-21 (discussing VLIW architectures). A VLIW processor design eliminates the need for complex instruction scheduling logic on the chip by shifting scheduling responsibilities to the compiler. Parallelism is determined explicitly by a compiler at a program code level. The compiler generates code which is suitable for parallel execution at multiple execution units and which allows for data and resource dependencies. When data dependencies are dynamic, they are explicitly mapped in the compiled code.
Such compiled code is organized as VLIW instructions encoding a number of independent operations that can be executed by execution units of the VLIW processor, in parallel. A typical VLIW instruction horizontally encodes multiple operations as a single, very long instruction; hence the name. These operations can flow directly to associated functional units with a minimum of decoding. NOPs are provided to functional units and a pure VLIW processor has no hardware scheduling interlocks, relying instead on the compiler for scheduling and dependency handling.
The Cydra 5 was an early VLIW implementation. See generally, Beck et al., The Cydra 5 Minisupercomputer: Architecture and Implementation, Journal of Supercomputing, 7, 143-180 (1993) (providing a retrospective on the Cydra 5 architecture). The Cydra 5 was capable of initiating one instruction every clock cycle at each of six functional units. The Cydra 5 supported two instruction formats: MultiOp and UniOp, as shown in FIG. 1. A MultiOp instruction, e.g., MultiOp instruction 110, included seven "containers," one for each of six functional units (e.g., containers 111, 112, 113, 114, 115, and 116) and a seventh (e.g., container 117) for instruction sequencing control. Each of the containers (shown collectively as containers 150) provided storage for an operation encoded in a format similar to that of a conventional load-store RISC instruction. However, in addition, containers 111, 112, 113, 114, 115, and 116 included respective predicate specifiers 121, 122, 123, 124, 125, and 126.
The effectiveness of the MultiOp format was highly dependent on the program, and on how effectively the compiler could extract operations for parallel execution. In particular, performance and capacity of the instruction cache was adversely affected if there was little inherent parallelism to encode, i.e., if the containers of a MultiOp instruction encoded mainly null operations (or NOOPS). The UniOp format, as exemplified by UniOp instruction 180, was provided for such cases and included six containers 181, 182, 183, 184, 185, and 186 for encoding operations (six per UniOp instruction). During each cycle, a single operation from a UniOp instruction was decoded and supplied to a single one of the seven units (i.e., to one of six functional units or to the instruction unit) and NOOPs were supplied to the remaining six units. Successive operations of the UniOp instruction were decoded and supplied during successive cycles. Functional units of the Cydra 5 had the same behavior whether an operation was supplied from a MultiOp instruction or from a UniOp instruction, although, in the case of an operation supplied from a UniOp instruction, the associated. (unencoded) predicate was forced to true. A given instruction stream included of arbitrary mixes of MultiOp and UniOp instructions.
A major objective of the Cydra 5 architecture was to allow the overlapping of loop iterations without requiring multiple copies of the loop body or complex compensation code. See generally, Dehert et al., Overlapped Loop Support in the Cydra 5, Proc. 2nd Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Mass., Apr. 3-6, 1989), pp. 26-38. The approach taken by the Cydra 5 was to execute a compiled, overlapped loop schedule of TL cycles, organized as Stage Count (SC) stages where: EQU SC=TL/II
and where II is an iteration interval. As the execution of the loop progressed, during the first II cycles, the first stage of iteration 1 executed. During the next II cycles, the first stage of iteration 2 and the second stage of iteration 1 executed, and so on until SC different iterations were executing in different stages. In Cydra 5 terminology, the first SC-1 iterations of a loop, i.e., when not all stages were yet executing, was collectively known as the prologue. During the final SC iterations, the opposite process occurred until the last stage of the last iteration was executed. In Cydra 5 terminology, the final SC-1 iterations of a loop, i.e., when all stages were again not executing, was collectively known as the epilogue. The remaining iterations, i.e., when all stages were executing concurrently, were known as the kernel.
The loop scheduling model described above presented the Cydra 5 with several significant implementation issues, including the following:
1. To have all stages of a loop executing in parallel, while representing the entire loop schedule as a single folded schedule of II cycles, a mechanism for controlling the execution of initial and final iterations when not all stages were executing (i.e., during the prologue and epilogue) was necessary; and
2. Since an iteration of the loop schedule includes disjoint portions from SC different stages, conditional branches around code to be skipped were inadequate and an alternate mechanism for handing of conditional code in loop bodies was necessary.
The Cydra 5 provided a single mechanism to deal with prologue and epilogue control and with conditional code in loop bodies. The mechanism was based on a file of single-bit Iteration Control Registers (ICRs). In addition, the Cydra 5 mechanism relied on a loop counter (LC) register, which kept track of the number of prologue and kernel iterations yet to be executed, an epilogue stage counter (ESC), and a BRanch to TOP of loop (brtop) operation which specified the first instruction of the loop body as its branch target address.
To exploit the Cydra 5 mechanism for prologue and epilogue control, ICR0 was set before loop entry and all other ICRs were cleared. On each successive iteration, an Iteration Control Pointer (ICP) into the ICR file was decremented, effectively shifting the ICR file. The brtop operation set a new logical ICR0 until LC reached zero. Thereafter, the logical ICR0 was cleared during each iteration until the ESC reached zero, indicating the end of the loop body. The compiler made first stage operations conditional on ICR0, second stage operations conditional on ICR1, etc. In this way, only first stage operations executed during the first iteration through the loop, only the first two stages executed during the second iteration, etc. At the end of the kernel, i.e., when LC reached zero, ICR0 was set to zero and first stage operations no longer executed. On each successive iteration, one less stage executed until the ESC reached zero and the loop was complete.
A number of elements of the ICR file equal to the stage count (SC) were used to provide prologue/epilogue control. Conditional code (including conditional code in the loop body) was handled similarly, using additional elements of the ICR file. See supra, Dehert et al., Overlapped Loop Support in the Cydra 5, for a more detailed description of Cydra 5 loop control, conditional execution, and brtop operation semantics.