1. Field of the Invention
The invention relates to improvements in methods and apparatuses for document image processing, and more particularly to improvements for recognizing and determining the frequency of phrases in a document without first decoding the words or images or referring to an external lexical reference.
2. Background
In computer based electronic document processing, an attribute of the document(s) being processed which the operator often desires to know is the frequency with which some or all of the words occur. For example, Salton & McGill, Introduction to Modern Information Retrieval, Chapter 2, pp. 30, 36, McGraw-Hill, Inc., 1983, indicates that in information retrieval contexts, the frequency of use of a given term may correlate with the importance of that term relative to the information content of the document. Word frequency information can thus be useful for automatic document summarization and/or annotation. Word frequency information can also be used in locating, indexing, filing, sorting, or retrieving documents.
Another use for knowledge of word frequency is in text editing. For example, one text processing device has been proposed for preventing the frequent use of the same words in a text by categorizing and displaying frequently occurring words of the document. A list of selected words and the number of occurrences of each word is formulated for a given text location in a portion of the text, and the designated word and its location is displayed on a CRT.
An extension of this thesis is that knowledge of the frequency of sequences of words in reading order in a document, i.e., phrases, also is useful, for example, for automatic document summarization. Phrase frequency information can also be used in locating, indexing, filing, sorting, or retrieving documents.
Heretofore, though, word frequency determinations have been performed on electronic texts in which the contents have been converted to a machine readable form, such as by decoding using some form of optical character recognition (OCR) in which bit mapped word unit images, or in some cases a number of characters within the word unit images, are deciphered and converted to coded representations of the images based on reference to an external character library. The decoded words or character strings are then compared with dictionary terms in an associated lexicon. Disadvantages of such optical character recognition techniques are that the intermediate optical character recognition step introduces a greater possibility of computational error and requires substantial time for processing, slowing the overall word unit identification process.
REFERENCES
European Patent Application No. 0-402-064 to Sakai et al. describes a text processing device in a computer system for counting the occurrence of words in a text and displaying a list of repetitive words on a CRT. The list includes the selected words together with their number of occurrences and their locations in the text. In a case where word repetition is undesirable, an operator may substitute synonyms or otherwise alter the text by using search, display, and editing actions.
European Patent Application No. 0-364-179 to Hawley describes a method and apparatus for extracting key words from text stored in a machine-readable format. The frequency of occurrence of each word in a file, as compared to the frequency of occurrence of other words in the file, is calculated. If the calculated frequency exceeds by a predetermined threshold the frequency of occurrence of that same word in a reference domain appropriate to the file, then the word is selected as a key word for that file.
European Patent Application No. 0-364-180 to Hawley describes a method and apparatus for automatically indexing and retrieving files in a large computer file system. Key words are automatically extracted from files to be indexed and used as the entries in an index file. Each file having one of the index entries as a key word is associated in the index with that key word. If a file is to be retrieved, and its content, but not its name or location, is known, its key words are entered and its identifying information will be displayed (along with that of other files having that key word), facilitating its retrieval.
Concurrently filed U.S. patent application Ser. No. 07/795,173, now U.S. Pat. No. 5,325,444 to Cass et al., and entitled "Method and Apparatus for Determining the Frequency of Words in a Document Without Document Image Decoding," which application is incorporated herein by reference, describes methods and apparatus for determining word frequency in an undecoded document image based on segmentation of the document image into image units and comparing image characteristics of selected image units with image characteristics of other selected image units to determine equivalence classes of image units. The invention described herein extends this image based word frequency methodology to determination of phrase frequencies without document image decoding.