Malware is a general term for malicious software. Underlying all such software is some malicious intent or purpose. For example, some malware is designed to vandalize by causing data loss or damage to equipment. Other malware is designed to steal data. Still other malware is designed to subject users to forced advertising.
Malware is often categorized by the way it carries out the underlying malicious intent or purpose. Infectious malware, for example, is so named because of the way in which it spreads from one system to another system, much like a virus. In fact, certain infectious malware programs are referred to as viruses. Concealed malware is often disguised as something that is not harmful or malicious.
The two most common vehicles that attackers have used to deliver malware have been emails and programs intentionally or unintentionally downloaded from the Internet. Notably, attackers often use electronic documents to deliver malware. Such documents may include Portable Document Format (PDF) documents, Microsoft® Word formatted documents, Hypertext Markup Language (HTML) documents and others. The more sophisticated and complex the format, the better the document format is for delivering malware.
In a broad sense, electronic documents are composed of data and metadata. The data is essentially that part of the document that is perceived by the party viewing the document. It is the intended content—the text, the images, the graphics. In contrast, the metadata provides information about the data (e.g., the purpose of the data, when the data was created, the format of the data, the source of the data). If, for example, the data is a digital image embedded in the document, the metadata associated with the digital image may convey the size of the image, the color of the image and/or the resolution of the image. While the metadata is processed in order to render the data, it is generally invisible to the user.
Prior methods and systems for identifying malware associated with or embedded in electronic documents have focused on pattern matching. Typically, these prior methods and systems search the data for certain predefined byte sequences. The byte sequences may, for example, represent certain predefined words that are indicative of malicious intent, for example, sexually explicit words. When these prior methods or systems identify one or more predefined byte sequences in a given document, the document may be discarded before the malicious intent can be realized or carried out. This process of searching the data for predefined byte sequences is sometimes referred to as pattern matching.
There are two significant deficiencies with pattern matching. First, the predefined byte sequences may not be malicious. In this instance, the pattern matching process may identify a document as containing malicious intent when, in fact, it contains no malicious intent. We may refer to this result as a false positive result. This is not a good situation because it may lead to the discarding of perfectly safe, malware free electronic documents. Second, the associated electronic document may not contain any byte sequences indicative of malicious intent. Nevertheless, the electronic document may still contain malware embedded in the metadata. In this case, the aforementioned pattern matching process would completely miss the malware because the process only looks at the data. This, in turn, may lead to a false negative result, where the malware embedded in the metadata of the electronic document goes undetected and the malicious intent is fulfilled.
This issue is exacerbated by the proliferation of malware creation programs, which allows attackers to augment certain file types with malicious payloads. Pattern matching efforts are largely baffled by this approach due to the wide number of malicious files that result from these programs.
What is needed is a method and/or a system that identifies malware embedded in electronic documents using a balanced approach, such that the discarding of otherwise malware-free documents is minimized and the discarding of documents containing malware is maximized.