When digitally encoded documents (for example in Microsoft Word, PowerPoint, Excel or other data formats) are passed between systems, content-checking is often applied to ensure that such documents do not represent a security threat: for example either do not leak sensitive information from the source system or do not carry attacks against the destination system, or both.
Content-checking is a well-known technique found in the many mainstream anti-virus and Data Leakage Protection products that are commercially available, as well as more specialist products such as those provided by Compucat. Such content-checkers are generally able to check a document and any other documents that may be embedded or contained within it. For example, where a picture is embedded in a spreadsheet, the content-checkers will check the spreadsheet and extract the picture and check that as well.
A known problem with content-checking is that the data structures used in common documents are complex and may be applied recursively and nested to several levels. This complexity is often introduced in order to minimise the space needed to represent a document or to make it easier for the application program to load or modify the document.
As a result of this complexity known content-checkers are themselves complex. Typically such content-checkers are implemented in software since the complexity of data structures which are nested and recursive structures are difficult to handle in hardware logic. The situation is made worse by the need to handle embedded content—potentially of a document type distinct from that of the enclosing document (for example a spreadsheet embedded within a slide presentation document)—and to support the modification of document structures by the content-checker itself. As a result it is difficult to produce an implementation which can be highly trusted and is high performance.
There are two known strategies for handling embedded data in a content-checker: the first is to recursively invoke a content-checker when embedded data is encountered; the other is to report embedded content to a controlling framework for later checking.
The recursive strategy is relatively straightforward to implement in software and works fast, because the embedded data can be worked on immediately it is encountered. It is also straightforward to allow such a checker to modify the data, since the checker for the outer “containing” document is in a position directly to replace original embedded data with a modified version in its output stream. This technique is used in mainstream content-checking products directed to removing viruses from documents.
The alternative strategy of reporting embedded content to a controlling framework, which then schedules it for independent content-checking, has the advantage that the embedded data can be checked using a completely separate checker so that any faults arising in one checker are contained. However it has the disadvantage that there is a high overhead of moving data in and out of the framework and modification of data is difficult since the context of its location within its enclosing document is lost.
In both strategies, the content-checkers work on the original document data. They check that the complex data structures are valid and that the information they carry is acceptable.
A related technique is that of transcoding documents in which a document is translated from one format to another as it passes from one system to another. For example, a JPEG image might be converted to BMP. The purpose of this is to destroy any hidden information that might be encoded in the original document's data structures and to ensure that the delivered document is in a normal form that will be safely handled by the recipient application. An example of this approach is disclosed in patent publication WO 2005/085971A1 entitled “Threat mitigation in computer networks”.