The present invention relates to electronic data processing, and more particularly concerns the compilation of source programs into object programs for execution on processors having multiple different architectures.
Fully optimizing a source program during the traditional compilation process for one specifically targeted computer architecture is difficult. Optimizing a program to run on many processors having vastly differing architectures is at best, only currently approximated in the current art. Additionally, optimization techniques that obtain peak performance on one specific architecture frequently can degrade the performance on other processors potentially, to a degree, that no optimization (at all) would be better. Today, a single source-code program is compiled to multiple versions of machine specific, executable, code. These xe2x80x9cexecutablesxe2x80x9d contain machine specific code that has been uniquely optimized for each different processor the compiler can target. Each version of these xe2x80x9cexecutablesxe2x80x9d requires separate handling, packaging, tracking, distribution, and upgrading. Alternatively, conventional interpreted programs such as Java run a single version of a universally executable code on different host processors using virtual machines. The virtual machines translate the universal executable into target specific machine code. These interpretive virtual machines see only one xe2x80x9cbyte-codexe2x80x9d instruction at a time, whereas most optimization techniques are global, taking into consideration a context including many or all of the instructions before and/or after the current instruction. The traditional alternative, to these interpretive engines, requires distributing a program directly in source-code format for compilation and optimization by users. This has serious disadvantages and costs
Almost all computer programs begin life as source code written in a high-level language such as C++ or BASIC. A compiler then converts the source code into object code comprising low-level operations that can be directly executed in a specific processor engine such as the Intel Corp. Pentium(copyright) or the Digital Equipment Corp. Alphas microprocessors.
High-level languages allow programmers to write code without bothering with the details of how the high-level operations will be executed in a specific processor or platform. This feature also makes them largely independent of the architectural details of any particular processor. Such independence is advantageous in that a programmer can write one program for many different processors, without having to learn how any of them actually work. The single source program is merely run through a separate compiler for each processor. Because the compiler knows the architecture of the processor it was designed for, it can optimize the object code of the program for that specific architecture. Strategically placing intermediate calculations in registers (if they are available in the target processor) is much faster than storing and reloading them from memory. Different processors can vary greatly in the number of register available. Some processors permit out of order instruction execution and provide different address modes or none at all. Register renaming is permitted in some processors, but not in others. Parallel instruction execution can also differ greatly between different processors. Unrolling loops and pulling loop invariant loads and calculations out of loops are also known optimization techniques that cannot be employed profitably without knowledge of the available resources (i.e. registers) on each specific target processor. Also, some processors include facilities for instruction-level parallelism (ILP) that must be explicitly scheduled by the compiler versus other processors where ILP is exploited via greedy out-of-order hardware techniques. The most common approach is to compile the same source program separately for many different processors, optimizing different source versions for each specific target processor in order to best exploit the available registers and available ILP. This allows very machine specific optimization, but incurs an implicit overhead at great cost to the developer and software manufacturer. A program developer must produce, distribute, maintain, and upgrade multiple versions of the program. New or improved processors require the development and distribution of additional versions. This versioning problem is a significant cost in terms of bug fixes, upgrades and documentation. The versioning xe2x80x9ctaxxe2x80x9d continues for the life of the product and can actually cause a product to fail in the market place due to a lack of profitability as a result of these hidden costs.
The versioning problem could be avoided altogether by distributing the original processor-independent source code itself This presents many problems. Different users might customize the source code in unusual and unanticipated ways. Distributing source code requires users to purchase a compiler and learn how to use it. Compilers of the type employed for major programs are large, expensive, and hard to use. Direct interpretation of high-level source code (e.g., BASIC and APL) is also possible, but is slow and wasteful of processor resources. Attempts in the 1970s to design processor architectures for direct hardware execution of high-level languages met with little success. Moreover, any public distribution of source code reveals information that enables unauthorized modification and other undesired uses of the code. For a number of reasons, the great majority of personal-computer application programs are fully compiled and shipped in the form of machine specific, object code, only.
Some languages are designed to be parsed to an intermediate language (IL) which contains machine instructions for an abstract virtual machine that is distributed to all users regardless of the architecture of the target processor they own. Programs in the Java language, for example, are distributed in a tokenized intermediate language called xe2x80x9cbyte codes.xe2x80x9d Each user""s computer, has an instantiation of the virtual machine that translates each byte-code of an IL program individually to a sequence target specific hardware instructions in real time as the developers program is executed. Because different virtual machines are available for different specific processors, a single intermediate-level program can be distributed for many different processor architectures. Current implementations of the idea for one universal instruction set that can run on any target processor or microprocessor have major performance problems. Since the IL is the same for every processor that it runs on, it is too generic to allow the target-specific translators to filly optimize the IL and thus they fail to create machine instructions that are even close to the quality of code produced by a conventional compiler that generates machine specific code for each specific target processor. Additionally, optimizations attempted for one target processor often exact a penalty on a different processor so severe, that it may nullify the benefit of performing any optimization at all. For example, consider a calculation whose result is used several times in a program. Doing the calculation once and saving the result to a register, greatly speeds the performance of a processor that has enough physical registers to hold that result until it is needed again. A processor having fewer registers may accept the register assignment, but store (i.e., spill) the register contents to memory when the translated machine instructions of the program try to use more registers than are physically implemented. Spilling the contents of registers to memory and, then reloading them where the values are required for reuse, can be slower than actually recalculating the value. On some architectures, no optimization at allxe2x80x94merely throwing away the result and actually redoing the calculation laterxe2x80x94results in code that runs significantly faster than an inappropriate attempt at optimization. This same strategy would be disastrous on a target processor where registers are plentiful
Thus, creating a process/method in a system that attains the goals of 1.) highly optimized, machine specific performance and 2.) one, single distributed version that""s executable on many different target processor architectures, has remained a significantly unsatisfied goal.
A version of an executable program is created that can be highly optimized, in a machine specific manner, rapidly, for many different processors with disparate architectural features. This is accomplished by annotating abstract instructions in the one executable, with an embedded graph structure which is processor independent. The graph uniquely encodes complex relationships between operands and operations in the one executable program.
The executable program is then distributed to multiple different systems. Every system that can execute this type of program uses a light weight (i.e., rapid) translator that produces machine specific code optimized by use of the embedded graph structure for that system. Sets of machine specific annotations are also provided to specifically identify use of the graph to optimize the program for such machines. The optimized program or code is then executed on that processor""s specific hardware.
The embedded graph is in a static single assignment form that is known in the art as SSA. The static single assignment form has been extended to incorporate complex memory references and machine operations as reaching definition sites for certain uses, which is then embed in abstract instructions that form the single executable that is distributed. SSA is used as a particular form because it guarantees that every abstract definition point in a program is unique and every use that is incorporated in the graph has one and only one, dominating definition. These properties of SSA allow edges in the graph to be easily labeled with instructions for the lightweight translation process on the different target processors in systems. The edges provide information to accomplish three classic types of optimizations. Information is provided by the edges and processor unique annotations to enable global redundancy elimination, loop optimization and scheduling machine instructions to exploit parallelism. Further edges provide information to constrain the optimizations. These edges comprise anti dependence in a region, and output dependence in a region.
The target specific translation process uses the embedded graph in the abstract instructions of the single executable, to produce highly efficient code (final hardware instructions) that is unique to the specific architecture the translator resides on. This allows for one version of a program to run with peak performance on both a processor with eight registers, or a processor with 128 registers.
A development system, according to the invention, compiles (i.e., converts) a textual, source-code program into an intermediate-language (IL) or tokenized form of low-level instructions for an abstract machine, and provides information in edges of the graph, concerning complex relationships between operands and operators in that IL. The IL program itself is independent of any particular processor or architecture and contains processor independent constructs, in the form of annotations or information, that are incorporated into the SSA-based graph edges. Processor specific annotations are also provided, which xe2x80x9cidentifyxe2x80x9d possible, legal, profitable, machine specific, optimizations identified in the edges, that should be performed when the IL program is translated for a specific hardware target.
The one IL program which includes the information containing graph edges, and its processor specific annotations are distributed to multiple users who may employ microprocessors each with significantly disparate internal architectures. Each user system includes an optimizing translator (i.e., virtual machine) specific to its own target microprocessor. These translators are specific in that they contain machine specific knowledge that is specific only to that target. Examples of this knowledge would be numbers of registers, machine instruction types and the techniques that the target employs to exploit instruction level parallelism.
During translation, the virtual machine examines the annotations and identifies which optimizations are best suited for that one particular processor architecture where the virtual machine resides and applies the identified optimizations encoded by the edges in the embedded graph so as to highly optimize the final machine program in a global manner with the goal of obtaining peak performance.
Additionally, this process method is powerful enough, that if the processor where the virtual machine resides is not specifically identified in the annotations, it is possible for the virtual machine to examine the constructs in the IL, in combination with the annotations, and still identify most profitable, machine or processor specific, target optimizations. Although generation of the graph edges and the processor specific annotations requires time at the initial compilation stage, use of pre-written annotations allows virtual machines in the users"" systems to operate very quickly, often in a just-in-time manner. This is due to the fact that the edges and annotations compactly and efficiently encode costly global alias information that could take hours of initial compile time to generate. Thus potential hours of complex analysis is captured in one single IL program that can then be translated in seconds, optimally, in a processor specific manner, for many different microprocessors.