This specification relates to static analysis of computer software source code. Static analysis refers to techniques for analyzing computer software source code without executing the source code as a computer software program.
Source code is typically maintained by developers in a code base of source code using a version control system. Version control systems generally maintain multiple revisions of the source code in the code base, each revision being referred to as a snapshot. Each snapshot includes the source code of files of the code base as files existed at a particular point in time.
Snapshots stored in a version control system can be represented as a directed, acyclic revision graph. Each node in the revision graph represents a commit of the source code. A commit represents a snapshot as well as other pertinent information about the snapshot such as the author of the snapshot, and the data about ancestor commits of the node in the revision graph. A directed edge from a first node to a second node in the revision graph indicates that a commit represented by the first node is a commit preceding a commit represented by the second node, and that no intervening commits exist in the version control system.
Static analysis can be performed on a code base, which may be referred to as a project. The project generally includes a collection of source code files organized in a particular way, e.g., arranged in a hierarchical directory structure, with each source code file in the project having a respective path.
Static analysis techniques include techniques for attributing changes to a code base to a particular source. The source can be a particular snapshot where the change occurred, or the source can be a particular developer entity that introduced the change, e.g., a developer or a team of developers. Common source code contributions that can be attributed by a static analysis system include lines-of-code metrics, e.g., lines of code added, lines of code deleted, net lines of code added, lines of code modified, or some combination of these. For example, churn is a lines-of-code metric that is a count of lines of code added, deleted, or modified. Source code contributions can also include violation metrics, which measure relative numbers of coding defects introduced or removed, e.g., the introduction of coding defects, the removal of coding defects, net introductions of coding defects, or some combination of these. A coding defect is a segment of source code that violates one or more coding standards. A data element that represents a coding defect may be referred to as a violation.
Branching is the process of making a copy of a snapshot of the code base that is developed independently. Thus, subsequent modifications on the new branch do not affect later commits on the previous branch. Merging is the process of incorporating two branches into a single branch. Branching and merging processes allow parallel development to occur along multiple versions of the code base. The developed features can then be merged back together at a later time. Developers working in parallel on different branches can create new features in the branches. Branches that are used to create such new features may thus be referred to as feature branches.
Attributing source code contributions and correctly interpreting the attributions is difficult for real-world code bases that have multiple branches. In particular, branching and merging can introduce situations in which some developers get credit or blame for work that was actually introduced by others.
In addition, not all branches in a code base have the same importance. For example, branches for abandoned software features have relatively little importance, while branches having final versions of commercially valuable software products have much greater importance.