Information rendered digitally as ASCII characters usually contains high levels of redundancy. Examples demonstrating such redundancy include measurements of entropy in the English language by Claude Shannon and others that indicate that each 7-bit ASCII character carries roughly one bit of information. (See, e.g., Claude E. Shannon, Prediction And Entropy Of Printed English. Bell System Technical Journal, pp. 50-64, 1951). One manifestation of this redundancy is the tendency of certain ASCII characters to follow others in specific sequences. These tendencies are measurable in all forms of highly structured ASCII data files, including XML or spreadsheet data rendered as ASCII characters in an ASCII data file.
When binary data is rendered as ASCII characters, there is an increase of apparent randomness among the characters. Example methods of rendering binary data as a string of 7-bit ASCII characters include Base64 and UUIC encoding. Another example of binary data which may be rendered as ASCII characters within is malicious executable code, or malware. Malware can be a computer virus, worm, Trojan horse, spyware, adware, etc. Ordinarily, malware is hidden within executable files. It is customary, when data is transferred from one network domain to another, to scan the data for executable malware because such malware could threaten the integrity of data in the destination network. File types with complex binary formats, such as Microsoft Office documents and PDF files, are considered high risk formats because of the many methods available to embed executable code that may be malicious within files in such formats.
Files containing only 7-bit ASCII content are considered low risk, because the content can easily be constrained to specific formats that may be verified with data filtering software. For this reason, ASCII text files are widely used to transfer information in high-security environments. However, in certain cases malware may be hidden within an ASCII data file. For example, it is possible to embed executable code in 7-bit ASCII using encoding methods such as base64 or UUencode, as is routinely done to attach binary files to emails. Before invocation, the coded executable must be decoded back to its native form. While encoded executable code cannot be invoked directly in encoded form, it still presents a threat to be mitigated in high security environments. In such environments, embedded binary code must first be detected before it is removed or quarantined.
If the ASCII file is highly structured, it is possible to write a data filter to parse the characters into defined fields whose string contents conform to acceptable rules. Such filters are known to provide a high level of security, but are also complicated and tend to be difficult to configure and maintain.
As a result, it is desirable to have a method and system for identifying binary data rendered as ASCII characters within an ASCII file to assist in the identification of and protection from malware hidden as binary data within the file.