Modification of electronic documents is known in the art. While PDF (Portable Document Format) documents were intended to be in a final presentation form, several plug-in tools are available for “touching up” the contents of the PDF documents. Plug-in tools, however, do not facilitate major modifications of PDF documents. Some tools convert contents of PDF documents into an editable form such as word processing. These tools may work well when the documents contain non-overlapping logical components on a page. The logical components may include bodies of text, graphics illustrations and images. If a page contains overlapping components, such as a body of text atop an image, the use of these tools tends to result in errors. Other tools require extensive user interaction and are often cumbersome to use to modify documents.
PDF documents preserve the look and feel of the original documents by describing the low-level structural objects or primitives such as characters, lines, curves and images and associated style attributes such as font, color, stroke, fill, etc. as known in the field of document processing. However, most PDF documents are untagged and do not contain basic high level logical structure information such as words, text lines, paragraphs for text and charts and logos or figures for graphical illustrations. As a result, the layout or the content of the document cannot be easily re-used, re-edited or modified. Reusing the layout of a PDF document may be desirable in variable data printing such as used in direct marketing or in preparing travel brochures for example. In this situation, a logical component may have to be replaced or modified. Images or figure illustrations may be replaced and body of text may be modified in revising the documents. The reuse and modification of contents is often desirable in document conversion as well.
At least some embodiments provide improved methods and apparatus for determining the logical components of a document.