This specification relates to static analysis of computer software source code.
Static analysis refers to techniques for analyzing computer software source code without executing the source code as a computer software program. Source code is typically maintained by developers in a code base of source code using a version control system. The code base includes one or more revisions of the source code in the code base.
Compilers and interpreters distinguish source code elements from one another by their names. Some source code elements have names that unambiguously identify a source code element. For example, within a single compilation classes in Java are uniquely identified by their fully qualified names. If any two source code elements in a same compilation have the same fully qualified name, the Java compiler raises an error.
However, static analysis systems often encounter source code elements that have identical names but which are actually different source code elements. Compilers and interpreters will only raise naming errors for a single compilation or interpretation, but a static analysis system can analyze source code elements from multiple different compilations or interpretations.
When a static analysis system assigns the same name to different source code elements that are actually different, undesirable things can happen. Properties from the different source code elements can be conflated or merged. For example, if the static analysis system counts lines of source code for a particular Java class that has the same name as another Java class, the class may end up having a number of lines of code that is a sum of lines of code of the individual classes. In addition, some attributes that are implicitly understood as being unique may actually have multiple values, e.g., a file path for a particular source code element or the first statement of a method.
The following scenarios illustrate common situations in which a static analysis system can encounter different source code elements having the same name.
For example, the code base may contain, in different files, source code elements having identical names that are never involved in the same compilation. This commonly occurs in testing suites when different test classes simply happen to have the same name.
In addition, a same source code element defined in a single file can be involved in multiple compilations with different compiler settings or environment variables, which can affect the semantics of the source code element. In this situation, each encounter with the source code element should be considered an encounter with a different source code element, even though their names are the same.
Some build systems may also modify the text of the source code during a build process, which can also affect properties of a source code element. In other words, a source code element in a later compilation may be properly considered a different source code element due to changes made by the build system after a previous compilation of the source code element.
Conversely, undesirable things can also happen when a static analysis system assigns different names to source code elements that are actually the same. For example, dependencies can be missed and data flow may not be properly tracked. This can happen, for example, when a different representation of the same source code element is encountered multiple times during a build process. For example, a first compilation can compile a source code element to generate a compiled representation of the source code element. Later in the build process, a second compilation can load or use the compiled representation of the source code element. In many situations, these different representations should be considered to be the same.
However, a further complication is that whether or not two source code elements having the same name should be considered to be the same is often application specific. For example, a same source code element can be defined in multiple files that are identical copies of the same library. There are some applications, e.g., violation finding, where these copies should be considered identical. And there are other applications, e.g., dependency analysis, where these copies should be considered to be distinct.