Computer systems are open to attack because conventional software applications can mishandle malformed documents. Application software may be written in such a way that it properly handles documents that it has created, but it may be induced to mishandle a malformed document designed to achieve this: such an application may for example exhibit unexpected behaviour, such as interpreting the malformed document's data as code that the application executes.
It is known to defend against attacks implemented by malformed documents by checking incoming documents received by a sensitive computer system and possibly from a potential attacker. Checking ascertains that an incoming document is correctly formed and consists of constructs that vulnerable applications running on the computer system are able to handle properly. Documents found to contain malformed constructs are blocked so they do not reach vulnerable applications.
Full content checking by a full checker may involve performing a complete check of the file's data against the file's format specification: for example, to ensure that an Adobe PDF document fully meets the PDF file format's specification, every byte of the file is compared with that specification. Alternatively, only a partial check against a file's format specification may be performed, but it can still be referred to as a full content check: for example, a PDF document may be checked to ensure that it has correct main structures of pages etc, but i without checking every byte that makes up a page description. A full checker may therefore perform a complete or partial check of a file's data against the relevant file format specification, and then it may also enforce some additional constraints: for example, there may be an additional check to make sure a PDF document does not contain any JavaScript code. In order to impose an additional constraint where a partial check is conducted, the partial check must cover appropriate parts of the relevant file format specification in enough detail to enforce the constraint.
Document checking pays regard to a document's file format. Most file formats are characterised by a document's first few bytes which have a Characteristic pattern: these can be referred to as Characteristic Header formats. When an application handles a document as a file with Characteristic Header file format, it opens the file and examines the file's first few bytes in order to determine the file's likely format. If the format determined in this way is one that the application is configured for, the application then proceeds to deal with the data in an appropriate way. Otherwise, i.e. if the format is not appropriate, the application stops trying to deal with the data and reports an error to its user.
For example, upon opening a file, Microsoft Word looks at the file's first few bytes in order to determine whether the file is in native Word 97 format, Word XML format, Rich Text format or plain text. It then proceeds to interpret the file's other data appropriately for that format.
For Characteristic Header file formats, content checkers are known which examine a file's first few bytes in order to deduce how the file's data will be treated by software applications and so apply checks appropriate for those applications.
Some file formats (referred to as “Headerless”) do not begin with Characteristic patterns, i.e. in the first few bytes. A software application might be required to search throughout a file having a Headerless file format in order to find a Characteristic pattern that indicates the start of data relevant to the application; if so, the application would ignore all data before the Characteristic pattern. The Zip archive format for Zip files is an example of a Headerless file format that is in widespread use: a Zip file has redundant data before and after its Zip data. Applications such as WinZip ignore data found at the start of a Zip file and instead search through the file for some characteristic bytes indicating that Zip data is present.
Headerless file formats present conventional content checkers with a problem. A content checker that intercepts a file passed to a sensitive computer system must determine the file's format in order to apply appropriate checks. If the computer system uses only Characteristic Header file formats, this would be a quick process as only the first few bytes of the file would need to be inspected; but if Headerless file formats are in use, then a content checker will need to search through all data in a file to ascertain whether or not the file contains any such format's Characteristic pattern: this can be time consuming.
A common means of speeding up the process of checking files with Headerless file formats is to rely on a file's name extension to determine its format. For example, a file with a name ending in “.zip” is considered to be a Zip file and so an application will open it as such. A content checker could use the same strategy. Having ascertained the file's type from its file extension, the content checker could proceed to check that the file's format complies with an appropriate format specification. In the case of a Zip file, the content checker would then search for a Characteristic pattern of Zip archive data and check that this pattern is correct. Files in other formats are checked against their respective format specifications: this avoids wasting time searching for Zip data.
Unfortunately, the strategy of relying on a file name extension is a poor one for a content checker. This is because it is not difficult to change the file name extension after the file has been checked, and it is possible for a file to conform to both the specification of a Characteristic Header format and a Headerless file format, or even to both of two different Headerless file formats: such files are referred to as “polymorphic”. A polymorphic file can be opened with equal success by applications that handle the file's different file formats. For example, it is possible for a file to be a valid JPEG/JFIF image file and a valid Zip archive file: such a file starts with the Characteristic pattern of a JPEG/JFIF file and contains Zip archive data within the body of the image file. For this reason a content checker needs to search for characteristic bytes indicating the presence of a Headerless file format even though a file has a valid Characteristic Header format. This means a simple approach to content checking will be relatively slow.
A malformed Headerless format file that starts with redundant data will normally be blocked by a conventional content checker of the kind which searches a file's contents in their entirety to find a format's Characteristic pattern, unless the file is polymorphic. On recognising a Characteristic pattern, a conventional content checker checks a file's content, and if it is acceptable, passes the file on for processing by a sensitive computer system which it is protecting. However, this does not provide a check for a possible additional format which the file also matches. Consequently, an application running on the computer system may receive a file checked as regards beginning with a recognised Characteristic pattern, but may then open that file as if it were in a different format: this results in the application interpreting the file's data in a way that has not been checked and which may be damaging.