This invention relates generally to static analysis of program code and, more specifically, relates to data flow analysis.
This section is intended to provide a background or context to the invention disclosed below. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived, implemented or described. Therefore, unless otherwise explicitly indicated herein, what is described in this section is not prior art to the description in this application and is not admitted to be prior art by inclusion in this section.
Taint analysis comprises searching for flows of data from untrusted points of input (the sources) to sensitive consumers (the sinks). In a static version of taint analysis, a program is examined without executing the code making up the program. Instead, a model of the program is created. Such a model can include the flows of data, typically represented using a flow graph, which is a representation of all paths that might be traversed through a program during execution of the program. These data flows are potential security issues unless each data flow passes through an operation (such as a sanitizer) that renders the data safe. Given a call graph G, a static taint analysis algorithm typically comprises two stages:
1) G is traversed to find sources, sinks and sanitizers in the code:                Sources are either values obtained through field-read instructions or values returned from calls to certain methods, called source methods;        Sinks can be either fields of certain objects or parameters of given methods, called sink methods; and        Sanitizers are only methods.        
2) An inter-procedural data-flow analysis is performed starting at the sources to determine if there are tainted flows that reach sinks without having been intercepted by sanitizers. The analysis is seeded at the variables defined by source constructs. That is, the field-read instructions and source methods are seeded with tainted values and the tainted values are followed via data flow analysis to determine the flow of the taint.
While such analysis is beneficial, there are still problems with these conventional analyses. One problem that can occur involves aliasing, where, in one example, multiple fields of multiple objects refer to the same value. Aliasing may also involve relations in the heap, i.e., multiple local names for the same object. As is known, a heap is an area of memory used by a program for dynamic memory allocation. In terms of taint analysis, the model used to emulate a running program would also emulate the heap for that program. Aliasing in the heap would be problematic, as if the object having multiple local names is tainted, all of the multiple local names should also be marked as tainted. However, many taint analysis tools do not consider or cannot handle aliasing in the heap.