Compilers take computer source code written in a high-level, generic language (such as C, C++, or Java) and translate it into a low-level, machine-specific object code. Compiling code for a simple, single-core processor may consist of a relatively straightforward, one-for-one translation of high-level instructions into low-level instructions. For example, accessing data in a C++ class may be compiled into address arithmetic and memory accesses machine-level instructions.
Compiling code for a processor having multiple functional units or support for vector processing, however, may be much more complicated. A typical goal is to run the compiled program as quickly as possible by keeping each core (and/or each processor, execution unit, and pipeline, in accordance with the specific hardware of a given device) as busy as possible. This goal, however, requires that instructions originally written in series/sequence be compiled to run in parallel, and not all instructions are capable of being executed concurrently. If an input for a second instruction depends upon the result of a first instruction, for example, the first and second instructions cannot run in parallel; the second instruction must wait for the first to complete.
A “smart” compiler recognizes instructions capable of being run in parallel and creates machine code tailored to do so (either explicitly, such as code produced for a very-long-instruction-word (“VLIW”) processor, or implicitly, such as code produced for a superscalar processor). Two broad categories of parallelizable situation include (i) instructions exhibiting instruction-level parallelism and (ii) instructions exhibiting data-level parallelism. Instruction-level parallelism refers to two or more instructions that have no dependencies on each other's output and may thus be computed in parallel. Data-level parallelism refers to performing operations on sets (i.e., vectors) of data in which individual operations on members of the sets are not dependent upon the operations involving other members. In order to add two matrices together, for example, the data-level parallelism of the elements in the matrices may be exploited to run some or all of the element-addition instructions in parallel because the element-level addition operations are independent.
One way that compilers achieve instruction- and data-level parallelism is by exploiting loops (e.g., for and while loops) written in the source code. Two or more iterations of a loop may be executed in parallel (i.e., “vectorization,” which takes advantage of data-level parallelism) and/or consecutive iterations of a loop may be partially overlapped (i.e., “software pipelining”, which takes advantage of instruction-level parallelism). One powerful algorithm for software pipelining is known as “modulo scheduling.” Regarding vectorization, a for loop (for example) may call for ten iterations; if the instructions executed in each iteration are independent of those of the other iterations, and if the compiler has access to (for example) five processing elements, the compiler may create assembly code that executes two iterations of the loop at each of the five processing elements in parallel. Regarding software pipelining, if, for example, a loop includes two instructions but the first instruction does not depend on the result of the previous iteration's second instruction, the first instruction of the next iteration of the loop may be scheduled to run in parallel with the still-executing second instruction of the current iteration of the loop.
One disadvantage of vectorization and of software pipelining is that they increase the size of the executed code. Vectorization requires code to cope with odd-sized, final iterations of loops (if, e.g., a loop requires eleven iterations and five processing elements are available, the last iteration uses only one of the processing elements). This “partially-filled” final iteration may be more than merely inefficient; many large-scale processor arrays are tuned to expect a steady stream of valid data, and individual processing elements may not be so easily turned off. Software pipelining requires set-up instructions (a “loop prolog”) to prepare the hardware environment before an efficient set of core instructions (a “loop kernel”) may be run, after which further overhead instructions (a “loop epilog”) are needed to tear down the loop and clean up the hardware environment for further instructions. In many cases, this additional, overhead code may be larger than the loop-kernel code itself and, on processors having limited instruction-cache or buffer capacity, may diminish performance. Another disadvantage is poor handling of loops having a variable number of iterations (known as a loop's “trip count”); because the trip count cannot be known at compile time, various tests of the trip count are required at run time, thereby increasing the run time of the program (especially when the trip count turns out to be small).
Existing systems that attempt to address these drawbacks may only create further disadvantages. For example, some processors (e.g., vector processors) implement a method of selectively disabling individual processing elements in the final iteration of a loop. Disabling processing elements in the final iteration of the loop, however, does not interact well with software pipelining, which overlaps instructions from various iterations in the loop kernel. Other systems express set-up, tear-down and steady state of a loop by storing the loop instructions in a fixed-size buffer and issuing a special loop instruction, but these systems not only place a limit on the size of the loop kernel (based on the size of the fixed buffer), but also cannot deal with more-complicated loops (such as those that require register renaming). Still other systems deal with complicated loops using an intricate set of rotating hardware registers, but these registers take up valuable real estate from other portions of the processor. A need therefore exists for a way to efficiently execute loop kernels of arbitrary size and complexity.