Static program analysis is the analysis of computer programs that is performed at compile-time without requiring execution of the programs. For example, some types of inter-procedural static analysis, run over a given computer program, generate a call graph that maps calling relationships between program variables that are present in the program. Call graphs are fundamental for many applications, such as those performing bug detection, compiler optimization, program understanding tools, etc.
More specifically, a call graph is a control flow graph that represents calling relationships between methods, functions, and/or procedures in a computer program embodied in particular computer code. A call graph for a particular computer program is comprised of nodes representing program variables referred to in the code, and edges representing calling relationships between the program variables. A program variable may be an object instantiated in the computer program, a call site at which a particular call is made in the program, or any kind of subroutine of the program.
Many static analysis frameworks provide whole program analysis to generate call graphs, which works very well on procedural-based computer programs. However, unlike procedural-based programs, object-oriented programs may include dynamic dispatches, which prevent determining the exact calling relationships that will be present for the programs at run-time.
This indeterminacy in object-oriented programs has been a significant hurdle for practical use of object-oriented call graphs in real-world tools. In order to compensate for the lack of information regarding dynamic dispatches, static analysis techniques generally make overly-pessimistic assumptions about call graph edges. In other words, in order to cover all possible calling relationships that may result from dynamic dispatches in object-oriented programs, many call graph techniques over-approximate calling relationships between program variables, for example, by converging all possible data-flow or control-flow facts in the program into its call graph.
Moreover, whole program analysis may not be available for analyzing libraries, included in the code of a computer program being analyzed, because libraries depend on the application environment that is not available prior to run-time. Further, the increasing size of modern object-oriented libraries means that a highly-inclusive call graph modeling a very large object-oriented library may contain more information than could feasibly be used by a real-world application.
A call graph that over-approximates calling relationships in the subject program is not a precise representation of the modeled program at run-time given that highly-inclusive call graphs may include some call relationships that would never occur in actual runs of the program. This reduction in precision is sub-optimal for many applications that utilize call graphs, such as a bug detection algorithm that only requires information about those call relationships that are highly likely to be present at run-time.
In order to accommodate applications that require that every call relationship in a call graph be a call that certainly occurs at run-time, static analysis techniques may under-estimate the call relationships in the modeled program. Specifically, such techniques omit any call relationship that is not sure to be present at run-time from the call graph. Many times, such under-inclusive call graphs do not completely represent the modeled program at run-time because some of the call relationships that were omitted will actually be present at run-time. In this way, applications that rely on a call graph that under-estimates call relationships in the modeled program do not have access to information about all call relationships that will occur in the program at run-time.
A number of static analysis techniques attempt to mitigate the issues that arise when constructing call graphs that model object-oriented programming code. For example, Class Hierarchy Analysis (CHA) builds a class hierarchy from the subject computer code that may be used as a basis for building a call graph. Specifically, the resulting class hierarchy can be used to look up the subtypes or supertypes of a given type in the modeled program. However, CHA does not take into account functions or instances of objects within object-oriented programs.
Rapid Type Analysis (RTA) refines CHA by pruning methods in a CHA-based call graph that can never be reached, i.e., based on the enclosing class of the methods never being instantiated in the program. RTA is strictly more powerful than CHA, and is still very fast and simple. However, RTA does not work with dynamic dispatches found in many object-oriented programs.
Variable Type Analysis (VTA) further refines the principles behind RTA. Specifically, RTA collects all objects that can be created in the whole program and uses that information to prune the call graph edges. VTA goes a step further by collecting all variables that are instantiated in the whole program being analyzed and uses that information to prune the call graph edges, providing more precise information than is available from RTA. Like CHA and RTA, VTA is not field-sensitive. Also, VTA handles dynamic dispatches, which makes the technique useful for object-oriented programs. However, VTA does not address the problem of over- or under-approximation of calling relationships described above.
Furthermore, Control Flow Analysis of Order k (k-CFA) was initially formulated for functional languages, but has since evolved to support object-oriented languages. It is a points-to analysis with k-call-site-sensitivity, field-sensitivity, context-sensitive heap, and on-the-fly call graph construction, where k limits the length of the call string indicating those one or more methods from which another method was called. ZCWL is an algorithm that essentially performs a k-CFA analysis in which k is the maximum call depth in the original call graph after merging strongly connected components (SCCs). Because k is different for each program, the number of contexts is much more variable than in the other variations of context sensitivity. However, ZCWL is memory intensive. As such, for large object-oriented programs, ZCWL can fail to complete due to insufficient available memory. However, as with VTA, k-CFA and ZCWL do not address the problem of over- or under-approximation of calling relationships described above.
Analysis of incomplete program code, or of object-oriented code that includes dynamic dispatches, runs across issues that are inherently undecidable. Thus, no analysis algorithm can return both a precise and correct object-oriented call graph, where a precise call graph includes only those call relationships that occur during run-time, and a correct call graph includes all call relationships that occur during run-time. As a result, applications must use either precise (smaller) call graphs or correct (larger) call graphs, neither of which may fully answer the needs of the applications.
As indicated above, type-based techniques such as CHA, RTA, and VTA can build an imprecise call graph, typically useful when it is beneficial to quickly compile a call graph that scales to large programs. Points-to-based techniques such as k-CFA and ZCWL can build a more precise call graph at the loss of scalability. However, in addition to failing to address the problem of over- or under-approximation of calling relationships, none of these works has addressed the open-world problem where the analysis is performed on incomplete program code (such as libraries) that can interact with unknown code.
Thus, it would be beneficial to construct more precise call graphs that can handle incomplete program code, taking into account dynamic dispatching in object-oriented programs, without losing information about less-likely call relationships.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Further, it should not be assumed that any of the approaches described in this section are well-understood, routine, or conventional merely by virtue of their inclusion in this section.