In addition to producing physical renderings of digital documents (e.g. paper prints), exchanging and archiving the digital documents themselves plays an increasing role in business as well as private communications. In order to facilitate exchange and provide universal access regardless of computer system and application, general page description languages are used instead of native word processor formats for exchanging digital documents. In order to reuse the text content of digital documents for archiving, indexing, searching, editing, and other purposes not related to producing a visual rendering of the page, it is desirable to identify the logical (reading) order, the semantic units (words of natural languages) and the correct semantics of the text.
Digital documents described in page description languages, such as the Portable Document Format (PDF), PostScript, and PCL, sometimes include redundant text which does not contribute to the semantics of a page, but creates certain visual effects only. Shadow text effects are usually achieved by placing two or more copies of the actual (semantic) text on top of each other, where a small displacement is applied. Applying opaque coloring to each layer of text provides a visual appearance where the majority of the text in lower layers is obscured, while the visible remainders create a shadow effect.
Similarly, word processing applications sometimes support a feature for creating artificial bold text. In order to create bold text appearance even if a bold font is not available, the text is placed repeatedly on the page in the same color. By using a very small displacement (relative to the font size), a bold text appearance is simulated.
Shadow simulation, artificial bold text, and similar visual artifacts create severe problems when the text contents are not only visually rendered, but must be reused, e.g., for searching or editing the text. The redundant text contents which contribute only to the visual appearance severely impact such applications since redundant text will be processed which does not semantically belong to the page contents.
It is an object of the present invention to provide a method of identifying fragments of text in digital documents which do not contribute to the semantics of a page, but which create visual artifacts only. Removing such redundant fragments enhances the accuracy of all processes which rely on the text semantics, such as searching, editing, or converting to other formats.