Some modern cyber attacks are conducted by distributing digital documents that contain embedded malware. Common file formats used for malware distribution include the Portable Document Format (PDF) and the Microsoft Word format (DOC, DOCX). When an unsuspecting user opens a malicious document, the malware embedded therein executes and compromises the user's system. Since system compromise is undesirable, methodologies and tools for classifying the maliciousness of documents, i.e. as being either malicious or benign, and determining their disposition, are needed.
One approach for classifying a document is to check for anomalies in static features extracted from the document. Another approach, such as that employed by antivirus scanners, is to test the document against byte-signatures derived from previously seen malicious documents. Yet another approach works by monitoring the run-time behavior of a document viewer for unexpected actions as it renders the document. All of these approaches for malicious document detection are trained on, or seeded with, characterizations of previously encountered malicious and/or benign documents. For instance, traditional antivirus systems rely on curated databases of byte-signatures to detect malicious documents and machine learning approaches rely on models trained using features (weighted byte n-grams, dynamic execution artifacts, etc.) extracted from a corpus containing malicious and/or benign documents. This results in inefficiencies, for instance unwieldy corpus sizes and unnecessary training.