This specification relates to static analysis of computer software source code.
Static analysis refers to techniques for analyzing computer software source code without executing the source code as a computer software program.
Source code is typically maintained by developers in a code base of source code using a version control system. Version control systems generally maintain multiple revisions of the source code in the code base, each revision being referred to as a snapshot. Each snapshot is a view of the source code of files of the code base as the files existed at a particular point in time. A snapshot may be thought of as including all the source code as of the point in time.
Source code in a code base is typically compiled in a build environment by a build system. The build environment includes an operating system; a file system; executable files, e.g., compilers; environment variables, e.g., variables that indicate a path to file system directories that contain source code files or executable files; and other configuration files for building source code in the code base.
Some build systems include multiple computers, which may be connected over a network. Build systems having multiple computers will be referred to as distributed build systems, which perform distributed builds. In a distributed build, the multiple computers of the distributed build system cooperate to build all of the source code files in a project.
A static analysis system generates analysis artifacts for source code files in a build system. An analysis artifact is a collection of data generated by a source code extractor or another static analysis tool, as opposed to an object file or an executable generated by the build utility or a compiler of a build system. Analysis artifacts can be stored as files of a file system or stored in any appropriate data repository, e.g., as records in a database.
The analysis artifacts generated by a static analysis system typically include various properties of the source code in the source code files, e.g., information that describes relationships between source code constructs in the snapshot, e.g., between variables, functions, and classes. An analysis artifact can also include information identifying various characteristic segments of source code having a particular attribute. Such attributes associated with segments of source code may be used for in multiple ways. For example, one kind of attribute may indicate how many lines of code are represented by associated segments of code. Another kind of attribute may measure the number of function points, or the cyclomatic complexity, or many other metrics familiar to those skilled in the art, associating each source segment with its value of the metric under consideration. Yet another kind of attribute may consist of text describing a problem or issue discovered within a particular segment of source code. Finally, an attribute may be used to provide navigational information, by, for example, associating each code segment that corresponds to a use of a variable or function with the definition of the corresponding variable or function, thus allowing a developer to easily view the definition. Many other kinds of attributes are possible.
The analysis artifacts generated in this manner can then be presented to the user by the static analysis system, optionally in an aggregated fashion. The static analysis system may use the analysis artifacts to display overarching metrics like the number of lines of code or function points that exist in the code base. The static system may also display warnings to the user, such warnings pertaining to particular segments of source code, or display statistics about the number and kind of warnings that have been detected. The static analysis system may also provide an interface for navigating the code base using the information contained in the analysis artifacts.
The files in a build system are typically identified, and distinguished from one another, by file paths. In some situations, a static analysis system might generate multiple analysis artifacts for the same build system file, e.g., when the file occurs at multiple file paths because the build system copied it, or because the build system is distributed and different file paths are used on different computers which form part of the distributed build system. In such a situation, the static analysis system performs redundant work. Furthermore, the identical analysis artifacts may result in the properties in the artifacts to be double counted because the analysis artifacts were generated for files have differing file paths. This can result in a database populated with properties of the analysis artifacts to double count properties of some files in the build system. Worse, where the analysis artifacts provide information about navigation or other attributes that pertain to multiple files, the different file paths may cause such attributes to be misinterpreted or displayed incorrectly. For example, a developer attempting to navigate to the definition of a variable may be shown an error page instead.