This application relates to a method and apparatus for producing executable computer instructions, and particularly to a method and apparatus for producing machine-executable object code from human-readable source code using a "smart recompilation" procedure.
I. General Discussion
A computer program is a list of instructions to be performed by a processor of a computer system. FIG. 1 shows a simple computer system 100 including a memory 10, a processor 20, input output lines 30, and a bus 35, which connects memory 10 and processor 20. In addition, FIG. 1 shows a machine-executable computer program 40, a human readable computer program 50, a compiler program 60, a linker program 70, and a compilation library 80, which are stored in respective portions of memory 10. Computer program 40 is a "machine-executable" computer program. Computer program 40 contains instructions that are performed by processor 20, but that are not easily read or understood by a human being. While FIG. 1 shows that memory 10 is a memory internal to computer system 100, it should be understood that in various systems, some or all of the contents of memory 10 may also be stored on external memory devices such as magnetic tape or disks.
In order to perform (or "execute") the instructions in machine-executable computer program 40, processor 20 reads instruction in machine-executable computer program 40 from memory 10 over bus 35, and performs whatever actions are indicated by each instruction.
As described above, the machine-executable instructions in computer program 40 are written in a format that can be processed by processor 20. Computer programs stored in this format are also called "object programs," "object code," "object code programs," "object files," or "object modules." A machine-executable program includes instructions executable by the processor or data addressable by the instructions, or a combination of instructions and data. Although it is possible for a human being to write a computer program in machine-readable object code format, it is common for human beings to write computer programs in a format that is easier for human beings to read and understand. Thus, computer programs are commonly written in a "high level language," which has some resemblance to human language. Typical high level languages include COBOL, Fortran, C, and Ada. Computer programs written in a high level language are also called "source programs," "source code," "source code programs," "source modules," or "source files." Source programs are commonly translated into machine-executable programs before being executed by a computer processor, such as processor 20.
Translation of source programs to object programs is typically performed by a computer program called a "compiler," such as compiler 60 of FIG. 1. FIG. 2 is a block diagram of a conventional computer system. As shown in FIG. 2, compiler program 60 (executed by a processor not shown in the figure) inputs a source program 202, translates the source program into machine executable instructions, and outputs an object program 212 containing the machine-executable instructions. As further shown in FIG. 2, compiler 60 may perform this translation between source and object programs for a number of additional source programs 206 and 208 to produce a number of respective object programs 216 and 218.
FIG. 2 also shows linker program 70 (which also is executed by a processor not shown). The purpose of linker 70 is to combine object programs created by compiler 60 to create a single machine-executable program 240. In other words, linker 70 operates to combine multiple machine-executable programs to produce a combined machine-executable program. The linker resolves global identifiers external to and used by a program. For example, a first object program may include a definition of an identifier that is used by other object programs. Such an identifier is called a "global" identifier because it may be used in modules other than that in which it is defined. The linker generates a machine-executable program containing references to physical memory addresses corresponding to global identifiers.
While simple computer programs may be written as a single source code program, complex computer programs are usually written as multiple source code programs (called "modules") that are subsequently compiled and linked to form a single machine-executable computer program. Writing multiple source code programs has several advantages, some of which are discussed below. First, it is easier for human beings to comprehend source programs if they are organized as modules. Second, if source programs are organized as modules, it is easier for a large number of people to cooperate in writing very large computer programs. Third, if a source program is organized as modules, each module may be compiled individually. Thus, if one module of a source program is changed, it may be necessary to recompile only the changed module, and not all the other modules in the source program. This third reason becomes very important for very large computer programs involving hundreds, or thousands, of modules. Considerable time and money are saved if ways can be found to avoid recompiling every source module whenever a change is made to any one source module.
The following paragraphs discuss reasons why it is sometimes necessary to recompile many source modules when a single source module is changed. FIG. 3 shows a plurality of source code programs ("modules") having various "dependencies" upon one another. The source code modules are stored in a memory (not shown). In FIG. 3, a first source code module 300 named "ModuleA" includes a declaration of a subroutine "Sub-A" and a declaration of a subroutine "Sub-B". Subroutine Sub-A is a "global" subroutine, which means that it may be called by modules other than the module in which it is declared. A second source code module 302 named "ModuleB" includes a call to the global subroutine "Sub-A" that is included in ModuleA. ModuleB also includes a declaration of a global subroutine named "Sub-C". A third source code module 304 named "ModuleC" calls Sub-C. A fourth source code module 306 named "ModuleD" calls Sub-B of ModuleA.
Because ModuleB calls a subroutine declared in ModuleA, ModuleB is said to "depend" on ModuleA. Similarly, because ModuleC calls a subroutine declared in ModuleB, ModuleC depends on ModuleB and indirectly depends on ModuleA. Because ModuleD calls a subroutine declared in ModuleA, ModuleD also depends on ModuleA.
In some conventional systems, if changes are made to any of the contents of ModuleA, ModuleA must be recompiled. In addition, all modules which depend on ModuleA, i.e., ModuleB, ModuleC, and ModuleD, also must be recompiled whenever ModuleA is recompiled. In conventional systems, all dependent modules must be recompiled in this situation because it is not possible to determine whether a change made to ModuleA is a change that affects a dependent module. For example, if the declaration of subroutine Sub-A in ModuleA is changed, this change will not affect ModuleD. In some conventional systems, however, ModuleD must still be recompiled because it depends on ModuleA.
The contents of the changed ModuleA can be viewed as a second source program. Compiling the second source program operates to produce a second translation result, and to replace a first translation result with the second translation result.
II. Smart Recompilation
Other conventional systems have modified the dependency concept so that not all dependent modules must be recompiled when changes are made to a module, such as ModuleA. This type of compilation is called "smart recompilation". FIG. 4 shows a plurality of source modules similar to the source modules of FIG. 3. The source code modules are stored in a memory (not shown).
In FIG. 4, each source module is broken into a number of parts called "fragments". Conventional fragmentation will be discussed below in detail. Each dependent module depends on a fragment, not on an entire module. Thus, for example, ModuleB depends on a fragment 310 of ModuleA because ModuleB contains a call to Sub-A, which is declared within fragment 310. ModuleB does not depend on, for example, fragments 309 and 311 of ModuleA because ModuleB does not reference any of the contents of fragments 309 and 311. Similarly, because ModuleC calls a subroutine declared in fragment 313 of ModuleB, ModuleC depends on fragment 313 of ModuleB. ModuleD depends on fragment 311 of ModuleA, but does not depend on fragments 309 or 310.
In conventional smart recompilation systems, if changes are made to any of the contents of a fragment in, for example, ModuleA, ModuleA must be recompiled. In addition, all modules which depend on the changed fragment also must be recompiled. For example, if the declaration of subroutine Sub-A in ModuleA is changed, ModuleA will be recompiled, and ModuleB, which depends on fragment 310, also will be recompiled. Because fragment 313 does not change, ModuleC does not need to be recompiled. ModuleD, which does not depend on fragment 310, also does not need to be recompiled. Thus, in FIG. 4, fewer modules need to be recompiled when a change is made to Sub-A than when a change is made to Sub-A of FIG. 3.
FIG. 5 shows a flow chart 500 of a method used for smart recompilation in conventional systems. It should be understood that the steps of flow chart 500 and of all the flow charts discussed herein can be performed by a processor of a data processing system. The steps of flow chart 500 can be performed by, for example, processor 20 of FIG. 1 executing compiler 60 of FIG. 1.
In step 502, processor 20 creates fragments from the source program to be compiled. In conventional smart recompilation systems, the fragments are generated directly from global identifiers in the source program, where each global identifier is in a separate fragment, and the fragments do not contain any information from the code generation phase. In conventional smart recompilation systems, fragments contain only simple, semantic information, such as the names of global identifiers and their types, that is not dependent on code generation.
In step 504, processor 20 compares the newly created fragments to fragments created previously to determine which fragments reference changed global identifiers, i.e., which dependent source programs need to be recompiled. In step 506, processor 20 generates object code for the source module that needs to be recompiled.
III. The "Smart Recompilation" Problem
Each time compiler 60 is executed for a specific module is called an "invocation" of compiler 60. Information generated by the compiler that does not change between invocations (unless the source program is changed) is called "semantic" information. In conventional smart recompilation systems, fragments refer only to "semantic" information, such as global identifier names and global identifier types. This semantic information can be derived from the source program before the source program is compiled.
Conventional smart recompilation systems have several disadvantages:
1. Conventional smart recompilation systems require that the compiler produce exactly the same output when faced with the same input. This requirement ensures that fragments will match unless the source program has been changed. PA1 2. Conventional smart recompilation systems have a negative impact on the ability of a compiler to generate optimized code because the fragments are not allowed to interact. Thus, for example, if a first module contains a declaration of a global identifier, a conventional smart compiler cannot look at modules using that global identifier when the compiler is deciding a size for the global identifier. PA1 3. Conventional smart recompilation systems do not work well with languages that contain language constructs requiring code that is visible across module boundaries, e.g., variable length arrays, where a global array is declared to be a size which is defined in a second module.
These disadvantages, while inconvenient for a language like C, have, in the past, made recompilation impractical for a language such as Ada. Existing smart recompilation systems for the Ada language have not used a fragmentation approach. Instead, these other smart recompiling systems for Ada have employed an "incremental approach." The incremental approach exploits the whole programming environment to reduce the size of the smallest compilable construct below the file boundary, e.g., each line of a source module is treated as a separately compilable unit. Then, dependency analysis is performed on these smaller units.
Compilation of certain high level computer languages, such as Ada, tends to generate information that may change between compiler invocations even when corresponding semantic information has not changed. Such information is hereinafter called "invocation specific information."
FIG. 14(a) shows examples of an Ada language construct that make it impractical to use conventional smart recompilation techniques. FIG. 14(a) shows an example of a compilation problem known as "overloading, i.e. the situation in which multiple variables, procedures, etc. of a same scope have the same name". Source program 1402 includes a procedure named "Example". Procedure Example includes two procedures, both named "P". A first procedure P has an integer parameter named X. A second procedure P has a floating point parameter named X. Conventional smart recompilation systems cannot compile source programs for languages that permit overloading because invocation specific information is used to access these procedures at run time.
Source program 1410 of FIG. 14(b) shows an example of certain programming language constructs that require global information to determine the size of an array. FIG. 14(b) shows a declaration of a type T, which is defined as an array whose bounds are determined at run time by a function call to the function "detbounds". In conventional smart recompilation systems, type T and function detbounds will be associated with separate fragments. In addition, a compiler would probably generate a temporary variable to hold the result of the runtime function call to the function detbounds. This temporary variable must be global, so that modules accessing the global type T can determine the bounds of arrays of type T. In conventional compilers, the location of the temporary variable is not determined until after compile time. A compiler may place this variable at different locations, depending on factors such as the size of the temporary variable, its alignment requirements, the presence of variable declarations in the source program, etc. Thus, the location of the temporary variable is not known at the time that a dependency analysis is conventionally performed. The location of the temporary variable is invocation specific.
Because a change in location of the temporary variable is not known when conventional dependency analysis is performed and because the variable location is invocation specific, it is not possible for a conventional smart recompilation system to determine when a dependency has changed.
In general, in conventional smart recompilation systems, a fragment changes only when its semantic information changes. When a compiler also generates invocation specific information, fragmentation and dependency analysis based solely on semantic information is insufficient.
A system for achieving smart recompilation by generating fragments having invocation specific information is described at length later in this application. In such a system, it is desirable to suppress changes in invocation specific information caused by a recompilation, thereby reducing the number of resulting recompilations.