Any discussion of the background art throughout the specification should in no way be considered as an admission that such art is widely known or forms part of common general knowledge in the field.
In broad terms, there are two main techniques by which to read an electronic document. The first is to use the native application that generated the document. Such an application understands the file format, encoding, compression, and so on present in the document, and is able to use this knowledge to process the document thereby to provide the intended rendered output. The second technique is to open the document as raw encoded text using an application other than the native application. This extracts textual information (i.e. a stream of characters) from the document, but not in a meaningful manner. Often, the extracted textual information is substantially or entirely devoid of human language.
There is also a hybrid approach used by some applications, which first extracts raw encoded text, identifies the document format, and then applies a set of stored rules for processing that document format thereby to provide a rendered output. Often this rendered output is not as sophisticated as the intended rendered output (as would be provided by the native application), but is sufficient for viewing and/or searching purposes. The hybrid approach fails, however, for unknown document formats.
There is a need in the art for improved systems and methods for processing unknown document formats.