Field of the Invention
The disclosed invention relates generally to a system, apparatus and method for code-level statistical analysis, and more particularly, but not by way of limitation, relating to a system, apparatus and method for usable code-level statistical analysis with applications in malware detection.
Description of the Related Art
Code-level analysis is a growing need. Both the size and the complexity of modern software systems are constantly growing. Challenges include use of complex frameworks and third-party libraries, obfuscation (for IP protection), dependence on the environment (e.g., cloud VMs, physical devices or containers), etc. In parallel, the properties of interest for analysis are often nontrivial, which adds yet another dimension of complexity. As an example, the landscape of security threats has become rich and diversified, with many new application-level threats and threat categories discovered every year, which creates a difficult challenge for security verification. All these different challenges due to software and property complexity have led to the point where automation is absolutely essential.
These same sources of complexity are also now posing a challenge to classical forms of static analysis, such as abstract interpretation, which create a bottom-up model of the program's semantics via a fixpoint process. Often constructs like exception handlers and reflection leads the analysis to an overly conservative solution, which limits its practical value.
In light of the challenges faced by traditional static program analysis, recently there is a trend of combining static analysis with machine-/statistical-learning techniques so as to empirically overcome noise introduced by certain specific code patterns/constructs. This has proven extremely effective, pushing the precision of static program analysis to another level.
What is lost along this evolutionary process (from traditional code analysis to analysis involving also machine learning) is the ability to relate the response provided by the analysis to the query at hand to code-level artifacts. In the past, the analysis would be able to generate a so-called code-level counterexample, in the case the property is determined to be violated, such that the user can reason about the problem (deciding if it's a true warning, and if so, how to address it). With statistical analysis, different aspects of the program are abstracted away as feature vectors, and so the report, while being more precise, is also completely opaque.
Therefore, there is need for providing a code level statistical analysis that is more efficient and usable in malware detection.