Documents created and executed by various applications, including, for example, document rendering applications such as Microsoft® Word® and Adobe® Acrobat® include not only simple binary content interpreted by the document rendering applications, but also can include, as part of the documents themselves, software necessary to interpret data in the documents. Because of their ability to contain and execute software, such documents can be considered complex code injection platforms. The injected code can be of various types, such as, for example, Macros (e.g., scripts written in Microsoft® Visual Basic®) and Javascript® (e.g., embedded in Adobe PDF® files).
While the ability to embed software into documents provides various advantages to users, it can also be used by attackers to launch attacks on digital data processing devices. In some cases, malicious code may attack upon execution. In other cases, embedded malicious code can lie dormant for use in a future multi-partite attack. For example, one type of attack embeds malicious code in the padding areas of the binary file format of documents or to replace normal textual data with malicious code.
One issue in inhibiting such attacks is that it can be difficult for a user or a system to determine whether code embedded in a document is, for example, useful and friendly or harmful and malicious. For example, software can be injected into a document as obfuscated encoded code (e.g., code represented as image data that, when decoded and rendered at runtime, can be executed to perform malicious activities). In some cases, attackers may even entice a user to launch embedded malicious code. For example, as illustrated in FIG. 20, embedded malicious object 2010 has the message “CLICK HERE” displayed below it. If the user follows these instructions, the user will have launched an attack on the user's own system. In some cases, a parsed document in the Object Linking and Embedding (OLE) structured storage format, which contains nodes and directories, can harbor various exploits, such as buffer overflows or vulnerabilities to other applications. For example, FIG. 22 illustrates an example of the internal structure of a parsed document in OLE format, where attackers may craft data that exploit the vulnerabilities which redirect the execution of Microsoft® Word® to a particular location to execute arbitrary embedded malicious code, such as in the “1Table” sector.
In some cases, attackers may obfuscate or shape the attacking code so that it appears to be the same as, for example, benign code surrounding it. Code, including benign code, tends to have a high entropy statistical distribution, so, some attackers, for example, may inject malicious code into benign code in an attempt to avoid detection. FIG. 21 illustrates an uninfected Microsoft® Word® document 2111 and the same document 2112 embedded with a known malicious code sample (in this case, the malicious code is known as “Stammer”). A Symantec® anti-virus scanner has been installed and is running on this system, however, it does not detect the embedded malicious code even though Slammer is a known attacker. In addition, there is no discernable change to the appearance of the document that may, for example, make a user suspicious that malicious code is present.
Generally speaking, embedding malicious code within documents is a convenient approach to attack a digital processing device. Such attacks can be targeted and difficult to stop due to the number of document-exchange vectors and particular vulnerabilities in word processing programs. Moreover, detecting malicious code embedded in a document is increasingly difficult due to the complexity of modern document formats.
Accordingly, it is desirable to provide methods, media, and systems that overcome these and other deficiencies of the prior art.