1. Technical Field
The present disclosure relates to information technology, and, more particularly, to source code analysis.
2. Discussion of Related Art
In computer science formal static analysis involves the automatic extraction of information about the possible executions of computer programs. A conventional approach for carrying out static source code security analysis is to model integrity and confidentiality violations as problems as to whether there is a path leading from one node to another in a graph (i.e., graph-reachability problems). In security analysis, the source node represents a statement reading untrusted user input, and the sink node represents a statement executing a security-sensitive operation (e.g., database access), where source vertices are the control locations within the program where untrusted data from the user is read, sink vertices are the locations where security-sensitive operations are performed. There are also locations in the application that are considered “sanitizers”, i.e., flows crossing through these locations that are endorsed (i.e., sanitized or validated), either universally or for particular kinds of vulnerabilities, wherein the user input changes status from untrusted to trusted having been checked positively (validated) or modified to contain only legal content (sanitized).
Static source code security analysis holds the promise of finding all vulnerabilities in an application because the analysis simultaneously models all possible execution paths within an application, and more, because of over-approximation.
In practice, it is highly challenging to fulfill this soundness need when applying static security analysis to modern, real-world applications, e.g., web applications whose code base is at the scale of 106 lines of code (LOC), excluding library code.
Applying standard analysis techniques to code of this scale is at best extremely expensive, and, at worst, the analysis crashes before completing the scan. This has led to the several ideas on how to scale the analysis.
A simple and popular solution is to cast bounds on the analysis budget by allowing the analysis to scan only a small neighborhood around each source, ignoring certain libraries or virtual-call resolutions, and constraining the size of the application's call graph. While bounds often yield a scalable analysis, they create several problems. First, the analysis is no longer predictable. A small change in the code may cause the analysis to exceed a bound. Second, and more importantly, bounds are inherently unsound.
Another common solution is to use synthetic models for large libraries, which represent the library's behavior simplistically. This saves the need to scan large amounts of code, but soundness again becomes a concern.
Another approach is modular analysis, where the analysis analyzes each method independently, and produces a general summary of that method. Later, when a client of that method is analyzed, the analysis can reuse the summary without having to reanalyze the method. While elegant and attractive, the modular approach is challenged by several fundamental questions: First, it's not clear how to construct a sound summary for a method manipulating pointer-based data structures. Summaries are valid only under the analysis scope under which they were built. If existing classes are modified or new classes are introduced, previously constructed summaries may have to be invalidated and recomputed, thereby canceling out the advantages of the analysis' being modular. Second, modular summaries are often imprecise due to the need to simultaneously account for all possible behaviors of the summarized method.
As such, there is a need for a method and apparatus for carrying out static source code security analysis in a scalable and efficient manner.