In addition to producing physical renderings of digital documents (e.g., paper prints), exchanging and archiving the digital documents themselves plays an increasing role in business as well as private communications. In order to facilitate exchange and provide universal access regardless of computer system and application, general page description languages are used instead of native word processor formats for exchanging digital documents. In order to reuse the text content of digital documents for archiving, indexing, searching, editing, and other purposes which are not related to producing a visual rendering of the page, it is desirable to identify the logical (reading) order of the text as well as semantic units (words of natural languages).
Page description languages, such as the Portable Document Format (PDF), PostScript, and PCL, provide the semantics of individual text characters as well as their position on the page. However, they generally do not convey information about words and other semantic units. The fragments comprising the text on a page may contain individual characters, syllables, words, lines, or an arbitrary mixture thereof, without any explicit marks designating the start or end of a word.
To make matters worse, the ordering of text fragments on the page may be different from the logical (reading) order. There are no rules for the order in which portions of text are placed on the page. For example, a page containing two columns of text might be produced by creating the first line in the left column, followed by the first line of the right column, the second line of the left column, the second line of the right column, etc. However, logical order requires all text in the left column to be processed before the text of the right column is processed. Extracting text from such documents by simply replaying the instructions of the page description language, and storing the characters instead of rendering them on a visual page, generally provides undesirable results since the logical structure of the text is lost.
In the following description, the terms “character” and “glyph” are used; it is important to distinguish both concepts. “Characters” are the smallest units which convey information in a language. Common examples are the letters of the Latin alphabet, Chinese ideographs, and Japanese syllables. Characters have a meaning; they are semantic entities. The known Unicode standard encodes characters. “Glyphs” are different graphical variants which represent one or more particular characters. Glyphs have an appearance; they are representational entities. Fonts are used to produce visual representations of glyphs (see description of glyph metrics below). There is no one-to-one relationship between characters and glyphs. For example, a ligature is a single glyph which corresponds to two or more separate characters.
Page description languages such as PDF offer a variety of operators for placing text on the page. The order of text and grouping of glyphs into fragments is completely up to the application creating the PDF. The PDF file format neither mandates nor guarantees any particular ordering of the text contents comprised in a page. Although PDF guarantees the faithfulness of the final visual representation of the page, a certain visual result may be achieved by many different combinations of page marking operators.
Some of these combinations may be related to the logical (reading) order of the text, while others may disturb or even invert the logical order. As an example, the text “This time” shown in FIG. 8 could be created by the following sequence in the page description which happens to contain the text in its logical order, and even includes a space character between words. In the PDF context, the numbers are x-y coordinates interpreted by the Td operator, whereas the text in parentheses is each interpreted by the operator Tj:50 700 Td (This)Tj (time)Tj
In this exemplary situation, extracting text and identifying word boundaries would be a trivial task. However, the exact same visual result could also be created by the following exemplary sequence in the page description:102 700 Td (time)Tj −52 0 Td (This)Tj
Although the visual page will look the same, the words comprising the text appear in inverted order in the page description, and a space character between the words is no longer present.
In the next exemplary combination there are no longer any words which could be identified directly in the page description, although the visual output is still exactly the same:134 700 Td (e)Tj −20 0 Td (m)Tj −5.32 0 Td (i)Tj −6.68 0 Td (t)Tj −18.664 0 Td (s)Tj −5.336 0 Td (i)Tj −13.336 0 Td (h)Tj −14.664 0 Td (T)Tj
Not only are there no longer any identifiable words, but the reading order of the characters comprising the two words is actually inverted.
To make matters worse, arbitrary mixtures of such output ordering schemes may be present in an electronic document. While it may seem pointless to create such output, it is actually very common due to the characteristics of the software creating the output. For example, a product may first create all text fragments on the page which are printed in one font, and then proceed with all text fragments in the second font, etc. Another product may proceed through all text fragments depending on their color (first all black fragments, then all red fragments, etc.). Still another product could produce output according to the order in which a human typed and edited the text: a word which has been added to the first line after many lines have been typed may appear later in the page description than the remainder of the first line.
EP 0 702 322 B1 describes a system and method of identifying words in a portable electronic format by the steps of: receiving a text segment from a page of a document having multiple text segments and associated position data including x and y coordinates for each text segment, creating a text object for each text segment, entering the text objects into a linked list, and identifying words from the linked list through analysis of the text objects for word breaks and through analysis of gaps between text objects using the associated position data. However, the method described therein in many cases gives poor results due to the following reasons. Documents with a more sophisticated structure such as documents with multiple columns are not processed correctly due to the fact that words belonging to one column are not grouped in the correct semantic order thus providing wrong results. A further problem of the above mentioned document is its disability of identifying spaced out characters as one associated semantic unit.
It is therefore an object of the present invention to provide a fast, reliable and error-proof method of reconstructing the semantic contents of a page, and a method of grouping characters into semantic units (words) to facilitate processing the textual contents of digital documents for editing, searching and similar tasks.