Using modern word processing program modules to create and edit electronic files, or electronic documents, is often convenient and efficient. However, under certain circumstances, files may become damaged or corrupted. Damaged files are often unreadable by the application program module that created them. Thus, the time invested in creating the file is lost unless some of the file can be salvaged.
There are many different causes of file damage, or file corruption. One cause is a communication error, i.e., garbled transmission of a file via a modem or network. Another cause of file damage is a disk error, i.e., a failure of the storage media on which the file resides. A bug in an application program module that creates a file may also damage the file. Still another cause of file damage is a failure in an operating system while a user is working with a file. Thus, document corruption is a common problem that is difficult to avoid.
When a file is damaged, users want to retrieve the ndamaged data, i.e. the undamaged bytes, from the file. Users are annoyed if they cannot retrieve any data from a damaged file because the file must be completely reconstructed. In a word processing electronic file, the most important data contained in the file is almost always the actual text of the document. The formatting and the non-textual elements are usually less important. Thus, many different converters have been developed to retrieve undamaged text from a damaged file. Unfortunately, these converters are often incompatible with modem file formats which are described below.
Files created by modem application programs often have complex file formats. Modem file formats typically contain intricate, interconnected data structures. For example, consider the file format of the "WORD 8.0" program, a word processing program module marketed by Microsoft Corporation of Redmond, Washington. The "WORD 8.0" program has a file format comprising both single byte ASCII characters and Unicode characters. Unicode is a worldwide character encoding standard that uses two bytes to identify a character by defining one two-byte value to represent the same character worldwide. Thus, modem file formats, such as the "WORD 8.0" file format, are often quite complex and may contain both single byte characters and multiple byte characters.
The complexity of modern electronic files has some interesting ramifications with regard to damaged files. Should a modem file become damaged, there is a high probability that it will be unreadable by the program that created it. In contrast, corruption of a file stored in a simpler file format is unlikely to cause the file to become unreadable. For example, consider a file stored in the relatively simple file format known as plain ASCII text. If data in a plain ASCII text file becomes damaged, a text editor may be used to read the file and a user may then correct the damaged portions of the file. Thus, the user may extract and salvage uncorrupted portions of the damaged file. However, modern electronic files are often unreadable when damaged because there is a scarcity of applications for extracting data from a modem file format. For example, once again consider the file format of the "WORD 8.0" program. Currently, only one converter and the "WORD 8.0" program itself can read the "WORD 8.0" file format. In contrast, if a plain ASCII text file becomes corrupted so that it is unreadable by a particular converter, there are many other converters that may be used to attempt to retrieve the undamaged data. Unfortunately, the converters designed to recover ASCII text can not extract text from damaged documents with multiple-byte text.
Therefore, if a modem application program file becomes damaged, it is likely that the damaged file will be unreadable by the creating application program. It is also likely that there is no external converter to read the file. Thus, in many cases, there is no way for a user to recover undamaged data contained within a damaged file.
There is a need in the art for a method and system for recovering text from a damaged file with a modem, complex file format. There is a further need in the art for a method and system for recovering text from a damaged electronic file comprising single byte characters and multiple byte characters.