This specification relates to static analysis of computer software source code. Static analysis refers to techniques for analyzing computer software source code without executing the source code as a computer software program.
Source code is typically maintained by developers in a code base of source code using a version control system. Version control systems generally maintain multiple revisions of the source code in the code base, each revision being referred to as a snapshot. Each snapshot is a representation of the source code of the code base as the source code existed at a particular point in time. A snapshot may be thought of as including all the source code as of a particular point in time, although all the source code need not be explicitly stored for every snapshot.
Static analysis results can be used for a variety of practical applications, which include attributing source code contributions and generating data that characterizes large scale trends in code bases. Attributing source code contributions means attributing changes introduced by a snapshot to a particular developer entity responsible for committing the snapshot. A developer entity can be a single developer or a group of multiple developers. Each developer is typically a human, although a developer can also be a software program, e.g., a “robot,” that writes source code. For example, a developer entity can be a lone developer, developers on a team, developers within a department of an organization, or any other appropriate group of developers. Static analysis systems can compute sophisticated metrics of source code contributions and present visualizations of such information. For example, a static analysis system can generate a lines-of-code graph that illustrates net lines of code contributed to a code base during a particular time period.
Source code in a code base is typically compiled in a build environment by a build system. The build environment can include an operating system; a file system; executable files, e.g., build scripts and utilities, interpreters, compilers, and source code generators; environment variables, e.g., variables that indicate a path to file system directories that contain source code files or executable files; and other configuration files for building source code in the code base.
Accurate source code analysis of source code in a software project often requires a static analysis system to instrument a build system used to build the software project. Instrumenting the build system allows the static analysis system to perform an instrumented build of the software project, during which the static analysis system can trace the build process by intercepting calls by the build system to compilers.
There are a variety of reasons that performing an instrumented build for a software project can result in more accurate analysis of source code in the project. As one example, tracing a build process allows the static analysis system to identify precisely the source code that is built for the software project without having to emulate the actions of the build system. A variety of build system mechanisms make this information difficult to obtain without tracing the build. For example, build system preprocessors can make arbitrary textual substitutions in existing source code files before a compiler is called. Preprocessors can also generate temporary source code files that are compiled and then deleted by the build system when compilation is complete. In addition, some build utilities, e.g., the “make” utility on Linux and Unix operating systems, can be programmed to copy source code files from one place to another during the build process. For example, a build utility can copy a file from one location to another for compilation because another source code file may include or depend on the copied file. The copied file may then be deleted by the build system after compilation is complete. Furthermore, source code generators can generate source code at build time that does not exist before the build process is started. In all of these situations, merely having read access to the source code files in a file system is insufficient for a static analysis system to extract all the source code that is actually built by a build system.
While performing instrumented builds for software projects can result in more accurate analysis of source code, the need to actually perform the build is a scalability bottleneck for large-scale static analysis systems. This is mostly because some manual configuration and labor is usually required to set up the build environment, identify build commands, and launch build scripts. Project documentation may specify, in a human-readable way, what dependencies and build commands are required. However, automatically analyzing and understanding such natural language instructions is not feasible with current NLP technology. Such manual configurations are unsuitable for a static analysis system that seeks to build thousands or tens of thousands of software projects automatically.
This bottleneck becomes worse for static analysis systems that perform analysis by comparing multiple snapshots within a single project, e.g., to attribute the introduction and removal of source code defects. In these cases, the static analysis system needs to perform instrumented builds for many, possibly thousands, of individual snapshots of each single software project. Manually configuring all of these builds across thousands of software projects is simply not feasible for a scalable static analysis system.