1. Field of the Invention
The invention pertains to the field of computer program compiler design, and in particular hardware resource management and optimization.
2. Background Information
Many modern microprocessors are designed to allow a high degree of instruction level parallelism, meaning that at any given time more than one instruction can be executing concurrently. The extent to which a microprocessor achieves a high degree of parallelism is not strictly attributed to more complex microprocessor designs, or necessarily even more resources in the microprocessor architecture (although both are factors). Rather, the full potential of the instruction level parallelism can be achieved only through management of the available hardware resources within the microprocessor.
Computer programs, such as application programs, like Microsoft Word™, are often written in high-level languages, such as C, C++, and BASIC variants. Because computer programs are written in high-level languages, they are easy for a computer programmer to read and understand. More importantly, programs written in high-level languages are easy to change.
Almost all high-level language and most lower language programs must be compiled before they are executed (although some BASIC programs are interpreted—meaning they are not first compiled—but even some of these have engines that must be first compiled so the interpretation can take place). This function is performed by a compiler.
Most compilers translate source code into assembly language instructions (“assembly”), and the assembly language is again broken down by an assembler into a series, or sequence, of binary instructions that are executed by the microprocessor. These instructions are called machine operations. The machine operations are represented by operation codes (also called “op codes”), which are the mnemonic in an operation and the associated operands. Often, the term “compiler” refers to a unit that handles compilation of source code into assembly language instructions, and assembly language instructions into machine operations.
One of the reasons many programs are compiled is because computer programmers try to achieve “code reuse”, which is the ability of source code to be reused in different microprocessors. Because the microcode and acceptable machine operations for different microprocessors vary widely, compilers are often tailored for particular microprocessors. As a consequence, the compilers themselves can vary widely. Some compilers are basically translation units that simply transform source code coming in into machine operations headed out, while others include scheduling and resource management tasks.
As microprocessors become more sophisticated and high-level programming is more common, the need for smarter compilers grows. As is mentioned above, many modern microprocessors can execute instructions in parallel. Compilers leverage this feature by attempting to increase the instruction level parallelism.
A compiler technique to exploit parallelism is instruction scheduling (or just “scheduling”). Scheduling involves ordering instructions for a microprocessor architecture (e.g., pipelined, superscalar, or very long instruction word (“VLIW”)). This ordering is so that the number of function units executing at any given time is maximized and so that intra- or inter-cycle wait time for resource availability is minimized. Some scheduling techniques include filling a delay slot, interspersing floating point instructions with integer instructions, and making adjacent instructions independent.
Another technique to exploit parallelism is resource management. Resource management typically involves re-organizing the instructions scheduled with an instruction scheduler according to resource availability.
FIG. 1A schematically represents how most optimizing compilers work. High-level language instructions 6 are passed into a compiler 8, which schedules and compiles the high-level language instructions 6, with the aid of an instruction scheduler 10, and a separately executed resource management module 12. The compiled instructions 14 (i.e., machine operations) are then passed along to the microprocessor 16. At the microprocessor 16, the instructions 14 are streamed in for execution and are first intercepted by issue and decoding logic 18. The issue logic 18 decodes each instruction to determine where to pass each of the compiled instructions 14—issuing each instruction to a pipeline 20 associated with a particular function unit 22, 24, 26.
In approximately 1999, Intel Corporation introduced aspects of the Itanium™Processor Family (“IPF”) architecture, which corresponds to a family of parallel microprocessors. The first generation of IPF processor is called Itanium™and is a “6-wide” processor, meaning it can handle up to six instructions in parallel in a cycle. The 6-wide instructions are encoded into two 3-instruction-wide words, each called a “bundle”, that facilitates parallel processing of the instructions.
The IPF encodes each bundle by organizing the instructions into pre-selected templates. The IPF provides a number of templates that represent certain general instruction patterns. Instructions are broken down into template “syllables” representing different functions or “instruction types”, which are executed by one or more function units, which are in turn classified by function unit (“FU”) type. For example, instructions are broken down into syllables corresponding to memory functions (M), integer functions (I), floating point functions (F), branch functions (B), and instructions involving a long immediate (L). The templates are arrangements of these template syllables (that is, the order of instructions slots in a bundle), such as MMI, MII, MMF, etc. (A list of the template types is available from the “IA-64 Application Developer's Architecture Guide”, Order Number 245188-001, May 1999, and available from Intel Corporation, in Santa Clara, Calif.)
The specific Itanium processor function unit to which an instruction is sent is determined by its instruction's template syllable type and its position within the current set of instructions being issued. The process of sending instruction to functional units is called “dispersal”. The Itanium processor hardware makes no attempt to reorder instructions to avoid stalls or a split issue. Thus, if code optimization is a priority, then the compiler must be careful about the number, type, and order of instructions inside a bundle to avoid unnecessary stalls.
When more than one function unit of a particular type is included in the microprocessor architecture, as is the case in the Itanium Architecture™, which has 2 M-units, 2 I-units, 2 F-units, and 3 B-units, modeling the dispersal rules become quite complicated using traditional techniques. FIG. 1B graphically depicts instruction slot to function unit mapping following dispersal rules. (This is further described in the document “Intel Itanium™Processor Reference Manual for Software Optimization”, Order Number 245473-003, November 2001, and also available from Intel Corporation, which also details the dispersal rules.)
In this paradigm, the compiler 8 is responsible for handling instruction scheduling, as well as instruction bundling and template selection. The microprocessor 16 then dispatches the instructions according to the template selected by the compiler 8. The advantage of this design is simplicity of issue logic 18.
An illustration is in order. For this we turn to TABLES 1 and 2. But first, some notes are in order on TABLE 1. First, the instructions are numbered, which is only the purpose of this description. Second, a stop code “;;” or “stop bit” is added to the assembly language to inform the hardware (microprocessor) of a cycle break. Third, assume that a microprocessor has two memory (M) execution units and two ALU (I) units available. Fourth, assume that a microprocessor can execute one bundle of instructions per cycle. Finally, assume that there are only two templates available MMI and MII.
A traditional instruction scheduler in a typical compiler often uses dependence critical path lengths as the primary cost function to schedule instructions. The instruction bundling and template selection are handled by a post-scheduling bundling module. Consequently, a traditional instruction scheduler may derive a two-cycle schedule as shown in TABLE 1, with instructions 1 (M), 2 (I), and 3 (M) in the first cycle and instructions 4 (M), 5 (I), and 6 (I) in the second cycle.
TABLE 111d a = [x]2add b = y,e31d y = [f];;41d c = [g]5add x = h,i6add d = j,k;;
The post-scheduling bundling module in the compiler then tries to encode the instruction in TABLE 1 into IPF instruction bundles with templates. However, when the instructions in TABLE 1 are processed by the bundling module, no “MIM” template can be found for the first cycle. The bundling module may try to re-order the instructions into an “MMI” template (so instructions 1, 3, 2), but this is not possible due to an anti-dependency (write-after-read dependency) on y with respect to instructions 2 and 3.
Thus, when the bundling module attempts to bundle the instructions into a valid template, instruction 3 is forced into a new cycle. The templates end up looking like: MII (1, 2, nop), MII (or Mxx, where xx represent valid assignments) (3, nop, nop), and MII (4, 5, 6). A cycle is wasted and 3 “nop” (no operation) instructions are issued. The resulting instructions are shown in TABLE 2:
TABLE 2{mii:1d a = [x]add b = y, enop;;}{mii:1d y = [f]nopnop ;;}{mii:1d c = [g]add x = h,iadd d = j,k;;}
Finite state automata techniques have been proposed for resource management. For instance, T. A. Proebsting and C. W. Fraser, in “Detecting Pipeline Structural Hazards Quickly,” Proc. of the 21st Annual ACM Symposium on Principles of Programming Languages, pp. 280–286, January 1994, proposes that a 2D lookup table be implemented to model resource contention in execution pipelines. A drawback to the Proebsting et al. approach is the large size of the lookup table, which has an upper bound of s×i bytes, where s is the number of states and i is the number of instruction classes (so over 86,450 two-byte entries in a system with 6175 states and 14 instruction classes). This, we note, was an improvement over a prior approach, which was a 3D lookup table requiring s×i×c bytes, where c is the cycle count (so over 3.1 million two-byte entries in a system with 37 cycles).
Also exemplary of the state of the art is V. Bala and N. Rubin, “Efficient Instruction Scheduling Using Finite State Automata,” Proc. of the 28th Annual International Symposium on Microarchitecture, pp. 46–56, November 1995, which further describes the problem of past solutions and an improvement over the Proebsting et al. approach, but still using the same basic framework.