This invention relates to conversion of electronic documents between different formats.
A significant number of written documents are created in word processing applications on computers. The Microsoft® Word and Corel® WordPerfect® programs are examples of two common word processing applications. In addition to allowing users to enter text in a document, conventional word processing programs allow users to place lists, tables, images, textboxes, equations and other types of objects in the document.
When a user creates a document in a word processing application, the application usually records information about the logical elements in the document. Each logical element has a logical type (for example that the element is a header, a paragraph, a table cell, or an image) and associated content (for example a string of characters or image data) having a visual appearance (for example a certain font, size, or color).
Often the author of a document may want other people to review the document. A convenient way of distributing the document to the reviewers is to distribute the document in electronic form, for example as an e-mail attachment. To ensure that the readers can open the document on their computers, the author may choose to convert the document into a final format before attaching it to the e-mail so the document can be read without having to use a particular word processing application. Such a final format document contains all the information necessary to display or print the document on most computers, that is, the associated content and the visual appearance of the source document's logical elements, but the logical types are typically ignored in the document conversion process. One example of a final format is the portable document format (PDF). The conversion of a source document into a PDF document is typically made by “printing” the source document from a word processing application, for example by using a printer driver that can generate a PDF document, such as a Adobe® Acrobat PDFWriter printer driver, or by using a PostScript printer driver to produce a PostScript document and then converting the PostScript document to PDF using a conversion program such as Adobe Acrobat Distiller.
The absence of logical element information in the converted document limits the usefulness of the converted document. For example, the converted document is not as easily accessible as the source document, especially for visually impaired users. Visually impaired users typically need the logical element information to find different paragraphs, sentences, tables and other elements in the document when using text-to-speech converters to read a document. Also, it is difficult or impossible to recreate a document containing the same information as the source document from the final format document, because the converted document contains no logical element information. Finally, it is difficult or impossible to reflow the content of a converted document to fit a particular size of paper, display device, or display frame. The source document can easily be reflowed in the word processing application, but if a user would like to reflow the final format document, the reflow tools would have to guess in identifying paragraphs, lists, tables, and other logical elements. Reflowing is further described in the commonly-owned U.S. patent application Ser. No. 09/635,999 entitled “Text Reflow In A Structured Document,” filed on Aug. 9, 2000.
Attempts have been made to overcome the problems outlined above. One suggested solution is to insert marks, such as PDFmarks, in the source document. Documentation about PDFmarks is available from Adobe Systems Incorporated (“Adobe”) of San Jose, Calif., in Adobe Technical Note #5189, copyright 1993-1999, which is available from Adobe. The PDFmarks identify the “boundaries” of logical elements in the source document and are carried through the conversion process into the PDF document. However, the user has to perform the extra step of inserting the PDFmarks manually or automatically in the source document before converting the document, which can be tedious, time consuming, and error-prone, especially if the document is large. The PDFmarks cannot be inserted directly into the source document. For example, in Word it is possible to insert fields and choose a “Print fields” command to print the content of the fields and the logical elements. If the printer is a PostScript printer, commands can be passed to the printer using the inserted fields. The PDFmarks are PostScript operators that can be used to support PDF features through PostScript. Using the operators, it is possible to create, delete, or modify PDF objects when a PostScript file is converted to a PDF file, which can be done using a conversion program such as Adobe Acrobat Distiller. The PDFmark method also has problems with accurately representing complex and nested logical elements of the source document in the PDF document, as well as elements that span pages. The problems arise because boundaries of some logical elements, such as paragraphs, may be intermingled with other logical elements, such as figures or other floating objects, or may overlap the boundaries of the page having the logical element.