A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the U.S. Patent and Trademark Office records, but otherwise reserves all copyright rights whatsoever.
1. Cross-References to Related Applications
The following concurrently filed and related U.S. applications are hereby cross referenced and incorporated by reference in their entirety.
"Method for Determining Boundaries of Words in Text" to Huttenlocher et al., U.S. patent application Ser. No. 07/794,392.
"Detecting Function Words Without Converting a Document to Character Codes" to Bloomberg et al., U.S. patent application Ser. No. 07/794,190.
"A Method of Deriving Wordshapes for Subsequent Comparison" to Huttenlocher et al., U.S. patent application Ser. No. 07/794,391.
"Method and Apparatus for Determining the Frequency of Words in a Document Without Document Image Decoding" to Cass et al., U.S. patent application Ser. No. 07/795,173.
"Optical Word Recognition by Examination of Word Shape" to Huttenlocher et al., U.S. patent application Ser. No. 07/796,119, Published European Application No. 0543592, published May 26, 1993.
"A Method and Apparatus for Automatic Modification of Selected Semantically Significant Image Segments Within a Document Without Document Image Decoding" to Huttenlocher et al., U.S. patent application Ser. No. 07/795,174.
"Method for Comparing Word Shapes" to Huttenlocher et al., U.S. patent application Ser. No. 07/795,169.
"Method and Apparatus for Determining the Frequency of Phrase in a Document Without Document Image Decoding" to Withgott et al., U.S. patent application Ser. No. 07/794,555 now U.S. Pat. No. 5,369,714.
2. Field of the Invention
This invention relates to improvements in methods and apparatuses for automatic document processing, and more particularly to improvements in methods and apparatuses for recognizing semantically significant words, characters, images, or image segments in a document image without first decoding the document image and automatically creating a summary version of the document contents.
3. Background
It has long been the goal in computer based electronic document processing to be able, easily and reliably, to identify, access and extract information contained in electronically encoded data representing documents; and to summarize and characterize the information contained in a document or corpus of documents which has been electronically stored. For example, to facilitate review and evaluation of the information content of a document or corpus of documents to determine the relevance of same for a particular user's needs, it is desirable to be able to identify the semantically most significant portions of a document, in terms of the information they contain; and to be able to present those portions in a manner which facilitates the user's recognition and appreciation of the document contents. However, the problem of identifying the significant portions within a document is particularly difficult when dealing with images of the documents (bitmap image data), rather than with code representations thereof (e.g., coded representations of text such as ASCII). As opposed to ASCII text files, which permit users to perform operations such as Boolean algebraic key word searches in order to locate text of interest, electronic documents which have been produced by scanning an original without decoding to produce document images are difficult to evaluate without exhaustive viewing of each document image, or without hand-crafting a summary of the document for search purposes. Of course, document viewing or creation of a document summary require extensive human effort.
On the other hand, current image recognition methods, particularly involving textual material, generally involve dividing an image segment to be analyzed into individual characters which are then deciphered or decoded and matched to characters in a character library. One general class of such methods includes optical character recognition (OCR) techniques. Typically, OCR techniques enable a word to be recognized only after each of the individual characters of the word have been decoded, and a corresponding word image retrieved from a library.
Moreover, optical character recognition decoding operations generally require extensive computational effort, generally have a non-trivial degree of recognition error, and often require significant amounts of time for image processing, especially with regard to word recognition. Each bitmap of a character must be distinguished from its neighbors, its appearance analyzed, and identified in a decision making process as a distinct character in a predetermined set of characters. Further, the image quality of the original document and noise inherent in the generation of a scanned image contribute to uncertainty regarding the actual appearance of the bitmap for a character. Most character identifying processes assume that a character is an independent set of connected pixels. When this assumption fails due to the quality of the image, identification also fails.
4. References
European patent application number 0-361-464 by Doi describes a method and apparatus for producing an abstract of a document with correct meaning precisely indicative of the content of the document. The method includes listing hint words which are preselected words indicative of the presence of significant phrases that can reflect content of the document, searching all the hint words in the document, extracting sentences of the document in which any one of the listed hint words is found by the search, and producing an abstract of the document by juxtaposing the extracted sentences. Where the number of hint words produces a lengthy excerpt, a morphological language analysis of the abstracted sentences is performed to delete unnecessary phrases and focus on the phrases using the hint words as the right part of speech according to a dictionary containing the hint words.
"A Business Intelligence System" by Luhn, IBM Journal, October 1958 describes a system which in part, auto-abstracts a document, by ascertaining the most frequently occurring words (significant words) and analyzes all sentences in the text containing such words. A relative value of the sentence significance is then established by a formula which reflects the number of significant words contained in a sentence and the proximity of these words to each other within the sentence. Several sentences which rank highest in value of significance are then extracted from the text to constitute the auto-abstract.