1. Field of the Invention
This invention relates generally to the field of computer aided software engineering. More particularly, the invention relates to an improved architecture for performing distributed software builds.
2. Description of the Related Art
Computer programs are typically built from of a set of source files and “include” files, which require linking with any number of software libraries. During the program creation process, modifying any one of the source files requires recompilation of that part of the program followed by relinking. This process may be automated with software engineering tools such as the “Make” utility designed by Stuart Feldman in the mid 1970's. The Make utility works off of a file called the “Makefile” which indicates in a structured manner which source and object files depend on other files. It also defines the commands required to compile and link the files. Each file to build, or step to perform, is called a “target.” Each entry in the Makefile is a rule expressing a target's dependencies and the commands needed to build or make that object. The specific structure of a rule in the Makefile is:
<target file>: list of dependencies
TAB commands to build target
A tree structure indicating dependencies for a series of exemplary source and object files is illustrated in FIG. 1. In the example, the target file a.out is dependent on foo.o and bar.o. In addition, the object file foo.o is dependent on the source file foo.cc and the header file foo.h, and the object file bar.o is dependent on source file bar.cc and foo.h (e.g., foo.cc and bar.cc may contain include statements including the file foo.h).
The Makefile used to specify the hierarchical relationship illustrated in FIG. 1 might read as follows:
a.out: foo.o bar.o                g++ -Wall -g foo.o bar.o        
foo.o: foo.cc foo.h                g++ -Wall -g -c foo.cc        
bar.o: bar.cc foo.h                g++ -Wall -g -c bar.ccThus, during the build process, if the Make utility detects that foo.h has been modified, it will reconstruct foo.o, bar.o and a.out (i.e., because they all depend, either directly or indirectly, on foo.h).        
Typical software projects are far more complex than that represented in FIG. 1. Even a modest-size project can have thousands of files, resulting in an extremely complex dependency structure. In addition, Makefiles may be arranged in a hierarchical structure with higher-level Makefiles invoking lower-level Makefiles to build pieces of the project, adding additional complexity to the build process. The Makefiles are usually supplemented with scripts in a language such as Perl, which invoke Make to produce daily software builds, analyze the output of Make, run automated tests, and so on.
As mentioned above, Make operates incrementally: it only regenerates a target file if one of its dependent files has changed since the last time the target was generated. Thus, in principle it should be possible to rebuild a very large project quickly if only a few source files have changed. In practice, though, there are many times when large projects must be completely rebuilt. The most important of these times is the “nightly” build: most development projects rebuild from scratch every night (a clean build) to make sure the system is still consistent and to generate production versions for testing and release. In principle, nightly builds could be incremental, but in practice the dependency information in Makefiles isn't perfect, so the only way to guarantee consistency between the sources and the compiled version is to build from scratch. Thus, nightly builds are virtually always clean builds. Engineering builds (those for the personal use of individual developers) are often incremental, but if a widely-used header file is modified then most of the project may need to be recompiled. Furthermore, integration points (where developers update their personal workspaces with all the recent changes to the shared repository) typically result in massive recompilation.
Because of the size of modern software projects, clean builds can take a long time. Out of 30 commercial software development teams recently surveyed, only 5 had clean build times of less than two hours. More than half had build times in the 5-10 hour range, and a few reported build times of 40 hours or more. Furthermore, most organizations support multiple platforms and versions, which adds a multiplicative factor to the above times.
Long build times have a high cost for companies where software development is mission-critical. They affect not only engineering productivity and release schedules, but also software quality and overall corporate agility. When a developer makes a change to source code it typically takes at least a full day (one nightly build) before the developer can tell whether the change caused a problem.
There have been numerous attempts to improve the performance of Make over the last two decades. They fall into two general classes: “faster” approaches that execute pieces of the build in parallel, and “smarter” approaches that avoid work entirely.
The -j switch in Gmake is an example of the “faster” approach. When this switch is specified, Gmake uses the dependency information in the Makefiles to identify jobs that don't depend on each other and runs several of them concurrently. For example, “-j 4” asks Gmake to keep 4 separate jobs (pieces of the build) running at any given time. Even on a uniprocessor this provides a modest performance improvement by overlapping computation in one job with I/O in another; when run on multiprocessor machines, additional speedup can be obtained. The parallel approach offers a high potential for performance improvement because there are relatively few dependencies between files in a build. In principle, almost every source file in a project could be compiled simultaneously.
Unfortunately, the dependency information in Makefiles is rarely perfect, especially in large projects with hierarchical Makefiles. As a result, parallel builds tend to reorder the build steps in ways that break the build. For example, a library might be used to link an application before the library has been regenerated, so the resulting application does not accurately reflect the state of the library's sources. Bugs like these are very difficult to track down (the source looks good, but the application doesn't behave correctly). Some organizations have attempted to maintain enough dependency information in Makefiles to enable robust parallel builds, but most do their production builds sequentially to be safe.
In addition to out-of-order problems, multiprocessor scalability limits parallel build speed. Multiprocessor servers typically have only 2-8 CPUs, which limits the potential speedup. Larger-scale multiprocessors may have as many as 32 or 64 CPUs, but these machines are quite expensive ($30K per CPU or more, compared to $1-2K per CPU for workstations and small servers). In addition, bottlenecks within the operating system may prevent an application from taking full advantage of large-scale multiprocessors.
A variation of the parallel build approach is distributed builds, where builds are run in parallel using a cluster of independent machines instead of a multiprocessor. This approach solves the scalability and cost issues with a multiprocessor, but still suffers from out-of-order issues. In addition, distributed builds can be impacted by a variety of distributed-system issues including, for example, high overheads for invoking tasks on remote machines which can limit performance; clocks on each of the machines must be carefully synchronized or file timestamps won't be consistent and future builds may fail (a target may appear to be up-to-date even when it isn't); reliability drops as the cluster size increases due to the lack of recovery mechanisms; and cluster nodes typically use a network file system to access files, which can be considerably slower than accessing files locally on a single build machine. Furthermore, reliability issues in the network file system can affect build reliability.
The second general approach for improving build performance is to reduce the amount of work that must be done, either by doing better incremental builds or by sharing results between independent builds. One example of this approach is the “wink-in” facility in Rational Software's ClearMake™ product. In ClearMake, generated files such as object files are stored in a version control system, along with information about how they were generated. When a build requires a new version of a generated file, ClearMake checks to see if that version has already been generated by some other build; if so, the existing file is used instead of creating a new version. This approach can potentially provide significant improvements when several developers each update their private workspaces with the latest sources from the central repository, or in nightly builds where little has changed.
However, ClearMake depends on the system's ability to capture every piece of state that could possibly affect the contents of a generated file. This includes the versions of files that the target file depends on, the exact commands used to generate the target, environment variables that supply additional arguments to the command, system header files, and so on. All of these pieces of state must be considered when deciding whether a previously-generated file can be used instead of regenerating the file. Even something as subtle as the user ID or the time of day could potentially influence the value of a generated file. If a significant factor is not considered, the system will use an incorrect substitute file. In our discussions with software development organizations, we found several groups that have considered the ClearMake approach, but none that are using it for production builds.
In summary, each of the approaches described above offers the potential for speeding up builds, but each makes the build process more brittle by increasing the risk that a build will fail or that it will be inconsistent with the sources. Of the 30 commercial software development teams surveyed, none had been able to achieve more than a 5-10× speedup in a reliable enough way to use for production builds, and only a very few have achieved even a 5× speedup. Most organizations run their builds completely sequentially or with only a small speedup, in order to keep the process as reliable as possible.