The present invention relates generally to electronic textual documents, and more particularly to translating an electronic textual document having a first format into an electronic target document having a second format.
Electronic textual documents exist in many different formats ranging from plain ASCII text to proprietary editor viewer formats. In recent years there has been greater interest and importance attached to storing documents in formats which are public or which may be translated into public formats. One particular class of public formats is known as Standard Generalized Markup Language (SGML), which is a standard-based tagging methodology that provides a platform and application independent document while allowing information such as formatting, indexing, and linked information to remain within the document. SGML is accomplished by embedding SGML-compliant codes known as tags to build the document into its final formatted form. These standard-based tagging methodologies are gaining in use, especially in the publishing industry. However, there exists a vast amount of electronic material and paper documents available for scanning and use of optical character recognition which are in non-standard formats that cannot be readily translated into SGML compliant formats.
Since the value of a document is dependent upon its accessibility, there is further value added when the document can be displayed in a different environment using different viewers which sometimes require different formats. Thus, there is a need to be able to translate documents from one format or tagging scheme to another. Currently, there are several types of document viewing/editing software available that provide internal or external translators that can go from their own format to an industry standard and formats of others. Essentially, these translators are written in low-level languages such as C or C++, or by using LEX and YACC to construct parsers. LEX is a tool for building lexical analyzers which identify the next token in the character stream being processed. YACC is a tool for creating rule-based parsers which receive the stream of tokens from the lexical analyzer and identify the pattern and ensure legal syntax. Once such a parser has been written to understand a particular format, code may then be written to output the information in the target format. Coding these translators is labor intensive and requires a great deal of time. Therefore, there is a need for an easy to use approach that translates documents without requiring a lot of time and specialized skill to write the translation code.
In addition to the translation problem described above, there often exists a need to restructure a document. If the document is in an SGML-compatible format, restructuring may be done by simply editing the DTD (document type definition). However, if the document is not in a standard format, it may be very difficult to restructure the document. Thus, if there was an easy to use translator, then this problem would be able to be overcome by approaching the restructuring as a translation problem.