1. Field of the Invention
The present invention relates generally to the field of data processors, and specifically to methods and apparatus for compiling computer programs into machine-readable form.
2. Description of Related Technology
As is well known in the computer programming arts, a compiler is an entity which complies or translates high level programming languages (such as C, C++, etc.) into a assembly language or machine readable form for use by a digital processor. A variety of different configurations of and approaches to digital processor code compilers in general are known in the prior art. A treatise on the early development of compilers and other language processing systems is contained in “The History of Language Processor Technology in IBM,” by F. E. Allen in the IBM Journal of Research and Development, Vol. 25, No. 5, September 1981, pages 535-548.
Compiler Approaches Generally
Exemplary prior art compiler approaches are now described.
U.S. Pat. No. 6,367,071 to Cao, et al. issued Apr. 2, 2002 entitled “Compiler optimization techniques for exploiting a zero overhead loop mechanism” discloses a compiler loop optimization techniques to take advantage of a zero overhead loop mechanism (ZOLM) in a processor, e.g., a ZOLM in the form of a zero overhead loop buffer (ZOLB). A compiler generates a first set of code, and then applies optimizations to the first set of code so as to generate a second set of code configured to operate efficiently with the ZOLB. The optimizations are designed to increase the number of loops of the first set of code that can be accommodated in the ZOLB, to further reduce the overhead of the loops placed in the ZOLB, and to eliminate redundant loading of the ZOLB. Optimizations for increasing the number of loops that can be accommodated in the ZOLB include, e.g., conditional instructions, loop splitting and function inlining. Optimizations for further reductions in loop overhead include, e.g., loop collapsing and loop interchange. Data flow analysis and loop peeling is disclosed to avoid redundant loading of the ZOLB.
U.S. Pat. No. 6,308,323 to Douniwa issued Oct. 23, 2001 entitled “Apparatus and method for compiling a plurality of instruction sets for a processor and a media for recording the compiling method” discloses compiling a source program for a processor having a plurality of different instruction sets at high speed by selecting an optimum instruction set. The compiling method comprises dividing a source program into a plurality of modules according to a predetermined unit, compiling the respective modules with respective ones of the plurality of different instruction sets, providing data for evaluating an efficient compiling process based upon the module compilations with the respective instruction sets, selecting an optimum instruction set among the plurality of different instruction sets by comparing the evaluation data, and inserting an instruction set changing command at a necessary portion for changing the instruction set.
U.S. Pat. No. 6,305,013 to Miyamoto issued Oct. 16, 2001 entitled “Compiling method for generating target program in accordance with target processor type, compiling device, recording medium with compiling program recorded therein and recording medium with conversion table used in compiling recorded therein” discloses a compiling method of generating a code of a target program operable in a desired target processor, in which an amount of operations required for the code generation is reduced. Specifically, a code generating section comprises a first converting section and a second converting section. The first converting section refers to a first conversion table stored in a first storage device to generate a low-level code from a high-level code, while the second converting section refers to a second conversion table stored in a second storage device to generate an output code from the low-level code. In the second conversion table, output codes indicating the same or similar function are associated to the common low-level code.
U.S. Pat. No. 6,260,189 to Batten, et al. issued Jul. 10, 2001 entitled “Compiler-controlled dynamic instruction dispatch in pipelined processors” discloses techniques for improving the performance of pipelined processors by eliminating unnecessary stalling of instructions. In an illustrative embodiment, a compiler is used to identify pipeline dependencies in a given set of instructions. The compiler then groups the set of instructions into a code block having a field which indicates the types of pipeline dependencies, if any, in the set of instructions. The field may indicate the types of pipeline dependencies by specifying which of a predetermined set of hazards arise in the plurality of instructions when executed on a given pipelined processor. For example, the field may indicate whether the code block includes any Read After Write (RAW) hazards, Write After Write (WAW) hazards or Write After Read (WAR) hazards. The code block may include one or more dynamic scheduling instructions, with each of the dynamic scheduling instructions including a set of instructions for execution in a multi-issue processor.
U.S. Pat. No. 5,946,492 to Bates issued Aug. 31, 1999 entitled “Compiler that reduces call stack size through identification of stackless variables” discloses an optimizing compiler to identify what are referred to herein as stackless variables. A variable is said to be stackless for a given call statement if the calling program does not have a need for the information stored in the variable when the calling program resumes execution after the program that is the subject of the call statement returns control of the processor to the calling program. The decision of whether a variable is stackless or not for a given call statement is made by determining whether the live range of the particular variable spans the location of the call statement in question. If a variable's live range is found to cross the location of the call statement, it is not considered stackless. However, if a variable's live range is not found to cross the location of the call statement, it is considered to be stackless for that particular call statement
U.S. Pat. No. 5,850,551 to Takayama, et al. issued Dec. 15, 1998 entitled “Compiler and processor for processing loops at high speed” discloses a compiler comprising a loop detecting unit for extracting information of loops, and a high-speed loop applying unit generating a first loop exclusive instruction, placing the instruction immediately before the entry of a loop, generating second loop exclusive instructions, and placing the instruction at each place to branch to the entry of the loop. Application in the context of variable length instructions is also disclosed.
U.S. Pat. No. 5,845,127 to Isozaki issued Dec. 1, 1998 entitled “Language processor and language processing method to generate object programs by compiling source programs” discloses a language processor for source program compiling and object file generation having a function call counter section to count the number of calls for each function during syntax analysis, a function data storage section to store the call count for each function counted by the function call counter section and the code size of each code for each function generated according to the source program analysis results, and a specific space function designation section which refers to the call count and the code size for each function stored in said function data storage section and designates the functions to be placed in the specific area held by the microcomputer in the program space so that the total sum of the function code sizes does not become larger than the size of the specific area with placing priority to the functions with many calls.
U.S. Pat. No. 5,828,884 to Lee, et al. issued Oct. 27, 1998 entitled “Method for compiling a software program and executing on a system which converts data between different endian formats” discloses a method for compiling a software program and executing the program on a data processing system which performs conversion between data formatted in differing endian formats, namely big-endian and little-endian formats, also known as byte swapping. The disclosed compiler generates object offsets to data elements in a source code module, then adds a format base to the object offset to calculate a data aperture offset for each data element, which is then stored in an object code module. The format bases for the data elements are chosen according to the data type of the data element. A base memory address is bound to each data element at runtime, load-time or compile time. The base memory address for each data element is added to the data aperture offset for the data element to calculate a data aperture address for the data element. As the processor executes the program and performs a transfer of a data element between the processor and data storage device, the processor provides the data aperture address of the data element to the byte swapping device. The byte swapping device selectively byte swaps the data element during the transfer according to a relationship between the data aperture address and the data conversion apertures. The compiler generates data conversion aperture values and the processor programs the data conversion aperture values into aperture storage elements in the byte swapping device when loading the program into system memory for execution. The data conversion apertures are chosen based upon the set of data types comprised by the data elements, namely data types which require byte swapping and data types which do not.
U.S. Pat. No. 5,790,877 to Nishiyama, et al. issued Aug. 4, 1998 entitled “Method for controlling a processor for power-saving in a computer for executing a program, compiler medium and processor system” discloses in a processor system including a plurality of hardware resources, a method for arranging a program to suppress the power consumption by the resources includes the steps of determining which ones of the hardware resources are to be operated and from which instruction cycle to which instruction cycle to execute each instruction of the program; and based on the determination, adding an instruction to lower frequencies of clock signals inputted to the hardware resources and an instruction to restore the frequency at positions adjacent to the beginning and the end of the period during which the hardware resources are not operated and compiling the program. The processor system decodes the compiled program and lowers the frequency of the clock signal inputted to the hardware resources in accordance with the frequency lowering instruction and the frequency restoring instruction detected in the decoding step. The clock signals sent to the hardware resources are stopped by the frequency lowering instruction to the resource of the hardware resources for which the clock frequency may be lowered to zero.
U.S. Pat. No. 5,790,854 to Spielman, et al. issued Aug. 4, 1998 entitled “Efficient stack utilization for compiling and executing nested if-else constructs in a vector data processing system” discloses a computer-implemented method is provided for compiling software code that performs nested conditional constructs in vector data processors. A vector bit stack is provided to record which processing elements were activated and which processing elements were deactivated during execution of a nested conditional construct. Subsequently, when an end of a first nested conditional construct is encountered, a state of the processing elements at a point in time in which the first nested conditional construct was initiated may be popped off of the vector bit stack and a second conditional construct or any other operation may be executed. Therefore, conditional constructs may be executed while ensuring the proper state of the processing elements. The compiler program effectively utilizes the vector bit stack to store prior states of each of the processing elements of the vector data processor such that the processing elements may be efficiently restored to a correct intermediate value.
U.S. Pat. No. 5,752,035 to Trimberger issued May 12, 1998 entitled “Method for compiling and executing programs for reprogrammable instruction set accelerator” discloses a microprocessor having a defined execution unit coupled to internal buses of the processor for execution of a predefined, fixed set of instructions, combined with one or more programmable execution units coupled to the internal buses for execution of a set of program instructions, to provide an on chip reprogrammable instruction set accelerator (RISA). Reprogrammable execution units may be made using field programmable gate array technology having configuration stores. Techniques for translating a computer program into executable code relying on the RISA involve providing a library of defined and programmed instructions, and compiling a program using the library to produce an executable version of the program using both defined and programmed instructions. The executable version can be optimized to conserve configuration resources for the programmable execution unit, or to optimize speed of execution. Thus, seldom used programmed instructions in the final object code can be replaced with segments of defined instructions to conserve configuration resources. Alternatively, specially prepared sets of programmed instructions can be used to compile programs. A variety of versions are formed using separate sets of programmed instructions and the best final version is selected. In addition, commonly used segments of instructions can be synthesized into a programmed instruction dynamically.
U.S. Pat. No. 5,555,417 to Odnert, et al. issued Sep. 10, 1996 entitled “Method and apparatus for compiling computer programs with interprocedural register allocation” discloses optimization techniques implemented by means of a program analyzer used in connection with a program compiler to optimize usage of limited register resources in a processor. The first optimization technique, called interprocedural global variable promotion allows the global variables of a program to be accessed in common registers across a plurality of procedures. Moreover, a single common register can be used for different global variables in distinct regions of a program call graph. This is realized by identifying subgraphs, of the program call graph, called webs, where the variable is used. The second optimization technique, called spill code motion, involves the identification of regions of the call graph, called clusters, that facilitate the movement of spill instructions to procedures which are executed relatively less often. This decreases the overhead of register saves and restores which must be executed for procedure calls.
U.S. Pat. No. 5,450,585 to Johnson issued Sep. 12, 1995 entitled “Compiler with delayed conditional branching” discloses an optimization method and apparatus adapted for use on a compiler for generating machine code optimized for a pipeline processor. A compute-compare-branch sequence in a loop is replaced with a compare-compute-branch sequence. A compute-compare-branch sequence is a sequence of instructions to compute the value of one or more variables, execute a comparison involving the variables, and execute a conditional branch conditioned on the comparison. In the compare-compute-branch sequence, the instructions of the compute-compare-branch sequence are reordered as follows. First, the comparison is executed. In the compare-compute-branch sequence, the comparison involves previously set values of the variables. Second, the computation is executed to compute the current values of the variables. Finally, the conditional branch conditioned on the latter comparison is executed so as to have the effect of executing during the previous execution of the sequence. One or more temporary variables store the previous values of the variables. They are set to the values of the variables at the end of the compare-compute-branch sequence. Before execution of the loop, the temporary variables are set so that the condition will not be met the first time the sequence executes. After execution of the loop, a comparison and a conditional branch are executed. The comparison involves the temporary variables, and the conditional branch is conditioned on the comparison.
U.S. Pat. No. 5,293,631 to Rau, et al. issued Mar. 8, 1994 entitled “Analysis and optimization of array variables in compiler for instruction level parallel processor” discloses a process for optimizing compiler intermediate representation (IR) code, and data structures for implementing the process. The process is embodied in a compiler computer program operating on an electronic computer or data processor with access to a memory storage means such as a random access memory and access to a program mass storage means. The compiler program reads an input source program stored in the program mass storage means and creates a dynamic single assignment intermediate representation of the source program in the memory using pseudo-machine instructions. To create the dynamic single assignment intermediate representation, during compilation, the compiler creates a plurality of virtual registers in the memory for storage of variables defined in the source program. Means are provided to ensure that the same virtual register is never assigned to more than once on any dynamic execution path. An expanded virtual register (EVR) data structure is provided comprising an infinite, linearly ordered set of virtual register elements with a remap function defined upon the EVR. Calling the remap function with an EVR parameter causes an EVR element which was accessible as [n] prior to the remap operation to be accessible as [n+1] after the remap operation. A subscripted reference map comprising a dynamic plurality of map tuples is used. Each map tuple associates the real memory location accessible under a textual name with an EVR element. A compiler can use the map tuple to substitute EVR elements for textual names, eliminating unnecessary load operations from the output intermediate representation.
U.S. Pat. No. 5,287,510 to Hall, et al. issued Feb. 15, 1994 entitled “Method for improving the efficiency of arithmetic code generation in an optimizing compiler using machine independent update instruction generation” discloses a process within an optimizing compiler for transforming code to take advantage of update instructions available on some computer architectures. On architectures which implement some form of autoindexing instructions or addressing modes, this process is intended to improve the code generated for looping constructs which manipulate arrays in memory. The process comprises selecting memory referencing instructions inside loops for conversion to update forms, modifying those instructions to an update form available on a particular processor, and applying an offset compensation to other memory referencing instructions in the loop so as to enable the program to still address the appropriate locations while using the available autoindexing instructions. The improved compiler and compiler process enables the compiler to convert those program instructions that would otherwise convert to autoindexing instructions not supported by the processor to autoindexing instructions that are supported.
U.S. Pat. No. 5,274,818 to Vasilevsky, et al. issued Dec. 28, 1993 entitled “System and method for compiling a fine-grained array based source program onto a course-grained hardware” discloses a parallel vector machine model for building a compiler that exploits three different levels of parallelism found in a variety of parallel processing machines, and in particular, the Connection Machine™ Computer CM-2 system. The fundamental idea behind the parallel vector machine model is to have a target machine that has a collection of thousands of vector processors each with its own interface to memory. Thus allowing a fine-grained array-based source program to be mapped onto a course-grained hardware made up of the vector processors. In the parallel vector machine model used by CM Fortran 1.0, the FPUs, their registers, and the memory hiearchy are directly exposed to the compiler. Thus, the CM-2 target machine is not 64K simple bit-serial processors. Rather, the target is a machine containing 2K PEs (processing elements), where each PE is both superpipelined and superscalar. The compiler uses data distribution to spread the problem out among the 2K processors. A new compiler phase is used to separate the code that runs on the two types of processors in the CM-2; the parallel PEs, which execute a new RISC-like instruction set called PEAC, and the scalar front end processor, which executes SPARC or VAX assembler code. The pipelines in PEs are filled by using vector processing techniques along the PEAC instruction set. A scheduler overlaps the execution of a number of RISC operations.
U.S. Pat. No. 5,247,668 to Smith, et al. issued Sep. 21, 1993 entitled “Methods of realizing digital signal processors using a programmed compiler” discloses a compiler for a digital signal processor allowing the designer to specify separately function, accuracy and throughput. The compiler employs a word structure having the signal attributes of bits, digits and subwords which all have a direct relationship to the size of the processor and throughput. From a budget of working bits and clock cycles implicit in the specification of accuracy and throughput the compiler is able to choose the optimal word structure for the application. The compiler can also propagate throughout an icon network, used for the specification of function, various estimation attributes such as word growth and quantization noise, which allow the designer to observe the effect of design changes without recourse to simulation.
U.S. Pat. No. 5,088,034 to Ihara, et al. issued Feb. 11, 1992 entitled “Compiling method for determining programs to be executed parallelly by respective processors in a parallel computer which transfer data with a data identifier to other processors” discloses a compiler for generating from a serially processed type source program described in a high level language the object codes to be executed in parallel by a parallel processor system which is composed of a plurality of processors marked with respective identification numbers and in which inter-processor data transfer system for identifying data for transfer by data identifiers is adopted. The serially executed source program is first translated to programs to be executed in parallel. The inter-processor data transfer processing is extracted from the flow of processings involved in executing the programs for parallel execution resulting from the above-mentioned translation, and all the interprocessor data transfer processings are attached with data identifiers such that no overlap takes place.
U.S. Pat. No. 4,965,724 to Utsumi, et al. issued Oct. 23, 1990 entitled “Compiler system using reordering of microoperations to eliminate interlocked instructions for pipelined processing of assembler source program” discloses compiling a source program described with assembler instructions, each of which defines microoperations, into a target program for use in a digital signal processor. If two of the assembler instructions are interlocked with each other and if another assembler instructions which is not associated with the interlocked instructions is present, it is inserted between the interlocked instructions to thereby reorder the microoperations of the source program. Thereafter, the microoperations thus reordered are combined so as not to conflict with each other with regard to the fields of the assembler instructions and resources used by the assembler instructions. Prior to combining the microoperations, whether or not a basic block of assembler instructions included in the source program having a loop may be determined. If so, a head portion of the basic block forming the loop may then be transferred to a tail portion of the basic block forming the loop.
U.S. Pat. No. 4,827,427 to Hyduke issued May 2, 1989 entitled “Instantaneous incremental compiler for producing logic circuit designs” discloses a computer aided logic design system for instantaneously compiling circuit component entries into a schematic model which provides immediate simulation of each entry or deletion into the electronic circuit schematic. The system includes software for processing logic designs which produces a signal table for storing all inputs and outputs of chips stored in a specification table. The processor also produces a call table that lists all chips from the chips specification table from which chip models can be retrieved and executed. Additionally, a software routine produces a netlist transfer table that specifies the transfer of signals within the signal table produced by software processing, which correspond to the signal distribution in the circuit being designed. After production of the signal table, specification table, call table and netlist transfer table, a software processing routine executes sequential values retrieved from the call table and netlist transfer table to create a second signal table which is compared with the first signal table. The software processing routine continuous to execute values retrieved from the call table and netlist transfer table and compare the first and second signal tables until both the second signal table being created is identical with the first signal table stored in memory. The software processing means also includes a delay which delays sequential processing until the comparing step for comparing the second signal table with the first signal table reaches a stable state.
Constants and Constant Pools
Constant values are used in all kinds of programs and many programming languages. Since a constant is read-only and may be used many times in a program, constants may be optimized to, inter alia, eliminate any duplicates. The well known “constant pool” is a set of data structures containing data that remains fixed throughout the execution of a program unit. By pooling, or putting all constants together in the same locations, the size of a program can be greatly reduced. This helps eliminate wasted space. In a low level language, a programmer might maintain a constant pool by hand. In a high level language, programming tools are used to maintain a constant pool.
Using a mechanism such as an ID (an index into the constant pool), a program can copy a constant value from the constant pool. When a new value is added to a constant pool, it is given a unique ID.
Constant pools may contain, among other things: string constants, information for exception handlers, data type descriptors for various data types, and function call descriptors (metadata describing a called function). For example, a constant pool for a program that prints text, named PrintText, may contain a function descriptor describing an invocation of the function ‘print’. The function descriptors are then followed by a set of constant strings, which represent the text to print.
Various approaches to structuring and accessing constants in RISC processors have been suggested. For example, U.S. Pat. No. 6,282,633 to Killian, et al. (Tensilica) issued Aug. 28, 2001 and entitled “High data density RISC processor” discloses a RISC processor implementing an instruction set which seeks to optimize the relationship between the number of instructions required for execution of a program, clock period and average number of clocks per instruction, as well as the equation S=IS*BI, where S is the size of program instructions in bits, IS is the static number of instructions required to represent the program (not the number required by an execution) and BI is the average number of bits per instruction. This processor is intended to lower both BI and IS with minimal increases in clock period and average number of clocks per instruction. The processor implements a variable-length encoding.
In attempts to lower IS and IE (the number of instructions required to implement a given algorithm), the Tensilica invention uses single instructions that combine that functions of multiple instructions typically found in RISC and other instruction sets. An example of a simple compound instruction is left shift and add/subtract. The Tensilica approach also utilizes a load instruction to reference constants, thereby ostensibly providing lower IS and IE than using a sequence of instructions if the load itself requires only a single instruction. Compilers compatible with processors offered by MIPS Technologies, for example, dedicate one of the 31 general registers to hold a pointer to a constant pool where 4-byte and 8-byte floating point constants are stored. If the area addressed by this register is less than a predetermined size (e.g., 64 KB offset range in loads for MIPS), the constants may be referenced by a single load instruction. For a constant that is referenced once, the 32-bit load instruction plus the 32-bit constant is the same total size as two instruction words. If the constant is referenced twice or more, the constant pool provides smaller total size. The tradeoff is different for other instruction lengths, such as the 24-bit size of the Tensilica approach, where the constant pool plus load is 56 bits vs. 48 bits for a pair of 24-bit instructions.
U.S. Pat. No. 6,275,830 to Muthukkaruppan, et al. issued Aug. 14, 2001 and entitled “Compile time variable size paging of constant pools” discloses a method and apparatus for paging data in a computer system. A set of data associated with a program unit is divided into pages such that no item of the set of data spans more than one page. The size of one page may vary from the size of another. When the program unit is compiled, metadata is generated that indicates the division of items into pages. At load time, a page mapping is generated based on the metadata. The page mapping is used to locate an item that belongs to the set of data. Other parts of the program unit, such as byte code, can contain references to items in the constant pool. Each reference specifies the number of the page in which the corresponding item will be stored at runtime, and the offset of that item within the page.
“Coloring” and Register Spilling
So-called “optimizing” compilers utilize one or more optimization algorithms such as common sub-expression elimination, moving code out of loops, eliminating dead code, strength reduction, and register assignment to make the code more compact and efficient. Register assignment can be very significant, since operations wherein the operands are obtained from and results return to registers can proceed at a much higher speed than those which require memory or storage device access.
The article “An Overview of the PL.8 Compiler,” by Auslander and Hopkins appearing in the ACM SIGPLAN Notices, Vol. 17, No. 6, June 1982, pages 22-31 describes an optimizing compiler with register assignment. Similarly, “Measurement of Code Improvement Algorithms,” in “Information Processing '80,” J. Cocke and P. W. Markstein, (edited by S. H. Lavington), pages 221-228, North-Holland, Amsterdam, (1980), and “A Program Data Flow Analysis Procedure,” F. E. Allen and J. Cocke, Communications ACM 19, pages 137-147 (1976), both discuss the objectives and concepts involved in the design of optimizing compilers.
U.S. Pat. No. 5,659,754 to Grove, et al. issued Aug. 19, 1997 and entitled “Method and apparatus for an improved optimizing compiler” discloses an optimizing compiler process and apparatus for more accurately and efficiently identifying live variable sets in a portion of a target computer program, so as to more efficiently allocate registers in a computer central processing unit. The process of the invention includes the steps of performing a static single assignment transform to a computer program, including the addition of phi functions to a control flow graph. Basic blocks representing a use of a variable are further added to the control flow graph between the phi functions and definitions of the variables converging at the phi functions. A backward dataflow analysis is then performed to identify the live variable sets. The variables in the argument of phi functions are not included as a use of those variables in this dataflow analysis. The dataflow analysis may be iteratively performed until the live variable sets remain constant between iterations.
Many compilers assume a large number of registers during their optimization procedures. In fact the result of each different computation in the program is conventionally assigned a different register. At this point a register allocation procedure is invoked to assign real registers, from those available in the machine, to these different (symbolic) registers. Conventional approaches use a subset of the real registers for special purposes while the remaining set is assigned locally. Between these assignments, results which are to be preserved are temporarily stored, and variables are redundantly reloaded. These approaches are inefficient in that significant processor cycles are wasted while data is being transferred between storage and registers or conversely, data is accessed from and returned to storage directly bypassing the registers completely.
“Register Allocation Via Coloring,” by G. J. Chaitin et al, appearing in Computer Languages, Vol. 6, pages 47-57, Pergamon Press, Great Britain, 1981, referred to above, describes the basic concepts of register allocation via coloring but utilizes a different approach to the “spilling” problem.
“The 801 Minicomputer,” by George Radin, published in the ACM SIGPLAN Notices, Vol. 17, No. 4, April 1982, pages 39-47, is an overview of an experimental minicomputer which incorporated an optimizing compiler utilizing the concepts of register allocation via coloring described in the above-referenced article by Chaitin.
The foregoing references observed that the register assignment or allocation problem is equivalent to the graph coloring problem, where each symbolic register is a node and the real registers are different colors. When two symbolic registers have the property that there is at least one point in the program when both their values must be retained, that property is modeled on the graph as a vertex or edge between the two nodes. Thus the register allocation problem is analogous to coloring the graph so that no two nodes connected by a vertex are colored the same. This in effect says that each of these two (or more) nodes must be stored in different registers.
However, a potentially significant shortcoming of the register allocation via coloring procedure referenced above regards the “spilling” problem; i.e., the situation where there are more data items to be retained than there are machine registers available. A number of different solutions for the spilling problem have been proffered, the following being exemplary.
U.S. Pat. No. 4,571,678 to Chaitin issued Feb. 18, 1986 and entitled “Register allocation and spilling via graph coloring” discloses an optimizing compiler which receives a high level source language program and produces machine interpretable instructions, including a method for assigning computational data utilized by the program to a limited number of high speed machine registers in a target CPU. Specifically, the patent discloses methods for determining that there are not enough registers available in the CPU to store all of the data required at the given point in time and for the determining which data should be stored in the system memory until they are actually needed. These methods utilize a graph reduction and coloring approach in making the “spill” decisions.
U.S. Pat. No. 5,249,295 to Briggs, et al. issued Sep. 28, 1993 entitled “Digital computer register allocation and code spilling using interference graph coloring” discloses a method for allocating internal machine registers in a digital computer for use in storing values defined and referenced by a computer program. An allocator in accordance with the present invention constructs an interference graph having a node therein for the live range of each value defined by a computer program, and having an edge between every two nodes whose associated live ranges interfere with each other. The allocator models the register allocation process as a graph-coloring problem, such that for a computer having R registers, the allocator of the present invention iteratively attempts to R-color the interference graph. The interference graph is colored to the extent possible on each iteration before a determination is made that one or more live ranges must be spilled. After spill code has been added to the program to transform spilled live ranges into multiple smaller live ranges, the allocator constructs a new interference graph and the process is repeated.
U.S. Pat. No. 5,946,491 to Aizikowitz, et al. issued Aug. 31, 1999 entitled “Register allocation method and apparatus for generating spill code as a function of register pressure compared to dual thresholds” discloses a method and apparatus for minimizing spill code in regions of low register “pressure.” The invention determines the register pressure at various locations in the computer program. When a live range is selected for spilling, spill code is generated to relieve the register pressure in regions of high register pressure, while spill code is avoided in regions of low register pressure. In this manner a minimum amount of spill code is generated, enhancing both the compile time and the run time of the resultant instruction stream.
U.S. Pat. No. 6,090,156 to MacLeod issued Jul. 18, 2000 and entitled “System for local context spilling for graph coloring register allocators” discloses a register allocator for allocating machine registers' during compilation of a computer program. The register allocator performs the steps of building an interference graph, reducing the graph using graph coloring techniques, attempting to assign colors (i.e. allocate machine registers to symbolic registers), and generating spill code. The spill code is generated by a local context spiller which processes a basic block on an instruction by instruction basis. The local context spiller attempts to allocate a machine register which is free in the basic block. If the basic block does not have any free machine registers, the local context spiller looks ahead to select a machine register for spilling. The register allocator improves the performance of a compiler by limiting the rebuilding of the interference graph and the number of the graph reduction operations.
However, despite the broad array of prior art compiler and optimization techniques, the prior art lacks the ability to effectively and efficiently handle variable- or mixed-length instruction formats within the instruction stream, including dynamically determining which form (of several possible) of a given instruction that it must generate, and optimising the selection of such varying formats based on one or more parameters. Furthermore, prior art techniques of register allocation and spilling handling are not optimized for the aforementioned mixed-length ISA environment, and do not take into account register set usage based on the ISA. For many of the smaller instructions, there are limitations to a subset of the general purpose registers. For example, of the “normal” number (e.g., 32) of registers, only a subset (e.g., 8) are available for the smaller or compressed instructions. Although these registers are the same color as the normal registers, there is no current technique in assigning a priority to the subset of the registers. Prior art coloring algorithms, including those of Chatin, et al described above, do not consider the actual register being selected. These algorithms are only concerned with edges and interference, and have no heuristic for choosing one machine register over another in the general purpose case (outside of the case where a register is assigned specifically to a GPR of a certain color by other optimizations).
Chatin and others do address the concept of a register that can have different colors: it is up to the coloring algorithm to determine which color to select based on register pressure and contention. There is no effort to select a specific color based on further compressing the size of the instruction, or reducing the overall size of the compiled function.
Spilling in general is assumed to be to memory locations since there are not enough GPRs to accommodate all of the virtual registers being used by the optimising compiler. This is the fundamental definition—to spill means to use memory to temporarily hold the results of an instruction due to too many registers alive across the span of the specific instruction. The prior art is generally deficient at localizing such spilling.
Address Canonicalization
Another area of interest in compiler and instruction set optimization relates to address canonicalization; see, e.g., the “canonical reduction sequence” on pg. 152 of “Principles of Compiler Design” by Aho and Ullman, April 1979. In practice, addresses are canonicalized to the specifics of the machine for which code is being generated. Typical decisions are made to base/index/scale operations as well as size of displacements and allowed formats (for example, a load instruction may have a base register plus either an immediate offset or an index register with a scaling factor). By generating the same sequence of instructions for the address (no matter how redundant), one hopes to take advantage of global common sub-expression elimination, such as that defined in “Global Optimization by Suppression of Partial Redundancies” by Morel and Renvoise, CACM February 1979; “The Pascal XT Code Generator” by Drechsler and Stadel, SIGPLAN Notices, August 1987; and Cliff Click, “Global code motion/global value numbering”, ACM SIGPLAN Notices, v. 30 n. 6, p. 246-257, June 1995.
One significant problem associated with prior art canonicalization techniques is that the decisions as to how to canonicalize the necessary address must be performed prior to the common sub-expression elimination (unless these very costly algorithms are run more then once, which is not practical in practice). Hence, an improved method for choosing the correct address canonicalization when an instruction set has 2 or more distinct sets of addressing is needed.