This invention relates generally to converting documents, and more particularly to converting documents that are formatted with a mark-up language.
PostScript and its variant Portable Document Format (PDF) are standard mark-up languages for formatting documents produced by word processing software programs. With a mark-up language, it is possible to exactly reproduce text, graphics, and bit maps (generally xe2x80x9ctextxe2x80x9d) on a printed page or display screen. As an advantage formatted documents are easily communicated and processed by many different types of output devices.
In formatted document files, text fragments and formatting commands for rendering the text are interleaved. The. formatted documents are processed by interpreters. An interpreter reads the formatted file to xe2x80x9cexecutexe2x80x9d the commands so that the location of the dots of ink on the page or the pixels on a screen can exactly be determined. The interpreter does not exactly deal with words or sentences, but with the more fundamental document components such as characters, lines, and graphics.
An excerpt from a PostScript formatted document may include the following commands and text:
xe2x80x9c%!PS-Adobe-2.0 . . . 16b(Re)o(ad)f(b)q(et)o(we)o(en)I(the lines!).xe2x80x9d
In PostScript, text fragments are enclosed-in parenthesis, and the commands are interspersed among the text. A text fragment can be a single character, a sequence of characters, a word, or parts of multiple words delimited by, perhaps, blanks and punctuation marks. As shown in the example above, words may often by split over several fragments so that the beginning and ends of the words themselves are difficult to discern.
The commands between the text fragments move the cursor to new positions on the page or new coordinates on the display, usually to modify the spacing between the letters and lines. Word separators, such as space characters visible in plain text, are usually not indicated in the formatted text, instead explicit cursor movement commands are used. Hence, word separators only become apparent as more white space when the text is rendered.
The general problem of determining where words start and end, i.e., word ordering, is difficult. PostScript does not require that characters be rendered in a left-to-right order on lines, and a top-to-bottom order on the page or display. Indeed, the characters may be rendered in any order and at arbitrary positions.
Therefore, the only completely reliable way to identify words in a formatted document is to interpret the commands down to the character level, and to record the position and orientation of the characters as they are rendered. Then, characters that are close enough together on the same line, according to some threshold, and taking the character""s font and size into consideration, are assumed to be in the same word. Those characters which are farther apart than the threshold are assigned to different words.
Finding the correct position of each character is particularly useful when rendering text for reading, since tabs, line spacing, centering, and other visual formatting attributes facilitate comprehension of the text. As is evident, exactly locating words in formatted text can be computationally more expensive than just simply rendering the text for reading.
This becomes a problem if it is desired to automatically process formatted document in order to create, for example, an index of the words. On the World Wide Web (the xe2x80x9cWebxe2x80x9d), many documents are available in PostScript (or PDF) formats. This allows users of the Web to exactly reproduce graphically rich documents as they were originally authored.
In order to locate documents of interest on the Web, it is common to use a search engine such as AltaVista (tm) or Lycos (tm). With a search engine, the user specifies one or more key words. The search engine then attempts to locate all documents that include the specified key words. Now the exact location of the words on the page is of minimal interest, only their respective ordering.
Some known techniques for indexing formatted documents, such as by using the PostScript interpreter Ghostscript, perform a total interpretation of the formatting commands, and apply some other heuristic to recover word delineations. This takes time.
A simple sampling of the Web would seem to indicate that the Web contains hundreds of thousands of formatted documents having a minimum total projected size of some 40 Gigabytes. With traditional formatted document parsing techniques, which can process about 400 bytes per second, it would take about 1200 days to index the bulk of the current PostScript formatted Web documents. Given the rapid growth of the Web, indexing the Web using known techniques would be a formidable task.
We provide a high-speed computer implemented method for converting a formatted document to an ordered list of words. Our method can, on an average, convert formatted Web documents about fifty times faster than known methods.
According to our method, the formatted document is first partitioned into first and second data structures stored in a memory of a computer by separately identifying text and code fragments of the formatted document. The first data structure stores the text fragments, and the second data structure stores the code fragments of the formatted document.
Adjacent text fragments are locally concatenated and matched against a word dictionary to form possible ordered word lists. This list contains every possible word that could be made from the text fragments contained in the document. A best ordered word list is formed by choosing a set of words that includes all of the text fragments and contains the fewest number of words.
In one aspect of the invention, we organize the text and code fragments as arcs and nodes of a graph. The nodes represent the code fragments, or equivalently the gaps between text fragments. In addition, the nodes define all places where a word might begin or end. An arc between two nodes represents the possibility of concatenating the intervening text fragments into a single word. The best possible word list is the one which can be graphically represented by the smallest chain of arcs starting at the first node and ending at the last node, and where each arc ends at a node where the next arc begins. This corresponds to a covering of the text fragments with the smallest number words, each word defined by one arc. In the case where there are multiple best ordered lists, we select the one with the highest minimum weight. The weight of an arc is determined by the number of times the word defined by the arc is used in a large corpus of documents.
In another aspect of the invention, the best possible word list is used to annotate the code fragments to show whether they represent a word break or not. Because code fragments reoccur frequently in documents, this accumulation of local information allows for a global determination to be made whether a particular code fragment is more likely to bind adjacent text fragments into a word, or to separate them. The global determination is used to correct occasional errors in the local matching.