The growth of computer networking has brought with it an explosion in the number of malicious software attacks, commonly referred to as malware attacks. Malware, short for malicious software, is software that is designed for hostile or intrusive purposes. For example, malware may be designed with the intent of gathering information, denying or disrupting operations, accessing resources without authorization, or other abusive purposes. Types of malware attacks may include viruses, worms, trojan horses, spyware, adware, botnets, rootkits, and denial of service attacks (DDOS). Malware developers typically distribute their software via the Internet, often clandestinely. As Internet use continues to grow around the world, malware developers have more incentives than ever for releasing this software.
Malware is often installed on computers running browsers while communicating with malicious web pages that exploit browser vulnerabilities. These malicious web pages may use an “exploit” or “exploit kit” to download and run an executable program of an attacker's choosing on a visitor's computer without any interaction from the user. That is, flaws in either the browser or automatically launched external programs and extensions can allow a malicious web page to install malware automatically when a user visits the malicious web page, such that the user may not be aware of the installation.
Often the web page accessed by a user is an “innocent” web page that has been compromised to reference a malicious web page. Attackers compromise the “innocent” web page in order to drive traffic to their malicious web page, which is used for hosting malicious code, such as malicious executables or browser exploit code. Attackers find ways to inject Hypertext Markup Language (HTML) code into an innocent web page, which causes the visitor's browser to request and process content from the malicious web page without the visitor's knowledge. For example, injected HTML may be an iframe or script HTML element, which are elements that do not require user interaction with the “innocent” web page to download malicious content. Attackers can use search engines to find web pages that are vulnerable. This often leads to mass injection campaigns where a large number of vulnerable web pages are injected with code to reference a few malicious web pages.
When a browser accesses a web page, a web server sends HTML data to the browser. The browser parses the HTML data and builds a Document Object Model (DOM) tree. The browser then requests additional resources referenced in the HTML, such as JavaScript files. Element nodes of the DOM are accessible from JavaScript. JavaScript is usually used to write functions that are embedded in the HTML of a web page and that interact with the DOM of the web page. When the JavaScript is parsed and executed, the nodes of the DOM can be changed based on the execution. The browser then renders the web page based on the DOM tree and information for each node of the DOM tree. In addition to the elements of the DOM, the browser provides other objects that represent various features of the browser, each of which have their own methods that can be called. The browser also has its own compiled code that implements any function or method called on DOM elements or JavaScript objects.
As an example, assume that a programmer has written JavaScript code to create a new div element in HTML using a method of the document object of HTML. Assume that the programmer has also written JavaScript code to append this new element to the body element of the DOM using the document.body.appendChild method. When this code is executed, the new div element will be appended to the body element of the DOM. Similarly, an attacker can write JavaScript code that, if successfully injected into the HTML of a web page, can cause an element, such as an iframe, to be appended to the body element of the HTML. This iframe could contain height and width values of zero, and reference a malicious web page. As a result, a user's browser accessing a web page with the injected code would also access a malicious web page, without the malicious web page being displayed. Malicious exploit code could then be downloaded to the user's computer, without any interaction from the user other than accessing the web page with the injected code.
Often multiple malware exploits or tasks are implemented in tandem, causing a computer to download, store, and then execute a malware executable, also referred to as a binary module. In many cases, a successful exploit results in the automatic installation of a malware binary module, often called a “drive-by download.” The installed malware may enable a malware attacker to gain remote control over the compromised computer system and, in some cases, enable a malware attacker to steal sensitive information, send out spam, or install more malicious executable modules over time.
Malware propagation wastes valuable resources, such as system user time, system administrator resources, network bandwidth, disk space, and central processing unit (CPU) cycles. Malware can also corrupt data files such that the originals may not be recoverable. Additionally, malware can cause the compromised computer to transmit confidential data (e.g., banking information, passwords, etc.) to the malware attacker.
Previous attempts to identify these types of malicious code have focused on matching certain expressions or strings in the code. Attackers are aware of these attempts and have taken further steps to hide their code from this type of matching. For example, attackers often obfuscate the injected code, so that it is unreadable by humans without analysis, and so that it is impossible to match with simple expression or string matching. Attackers also spread scripts out over multiple files that are requested by the browser, and then concatenate the pieces together in execution of the code.
Accordingly, an improved approach to identifying malicious code is needed. The disclosed embodiments address one or more of the problems set forth above.