Cross-References to Related Applications
The following concurrently filed and related U.S. patent applications are hereby cross referenced and incorporated by reference in their entirety.
"Method for Determining Boundaries of Words in Text" to Huttenlocher et al., U.S. patent application No. 07/794,392.
"Detecting Function Words Without Converting A Document to Character Codes" to Bloomberg et al., U.S. patent application No. 07/794,190.
"A Method of Deriving Wordshapes for Subsequent Comparison" to Huttenlocher et al., U.S. patent application No. 07/794,391.
"Method and Apparatus for Determining the Frequency of Words in a Document without Document Image Decoding" to Cass et al., U.S. patent application No. 07/795,173.
"Optical Word Recognition By Examination of Word Shape" to Huttenlocher et al., U.S. patent application No. 07/796,119.
"Method for Comparing Word Shapes" to Huttenlocher et al., U.S. patent application No. 07/795,169.
"Method and Apparatus for Determining the Frequency of Phrases in a Document Without Document Image Decoding" to Withgott et al., U.S. patent application No. 07/794,555.
1. Field of the Invention
This invention relates to improvements in methods and apparatuses for document image processing, and more particularly to improvements in methods and apparatuses for recognizing semantically significant portions of a document image and modifying the document image to emphasize the recognized portions without first decoding the document or otherwise understanding the information content thereof.
2. Background and References
It has long been the goal in computer based electronic document processing to be able, easily and reliably, to identify, access and extract information contained in electronically encoded data representing documents; and to summarize and characterize the information contained in a document or corpus of documents which has been electronically stored. For example, to facilitate review and evaluation of the information content of a document or corpus of documents to determine the relevance of same for a particular user's needs, it is desirable to be able to identify the semantically most significant portions of a document, in terms of the information they contain; and to be able to present those portions in a manner which facilitates the user's recognition and appreciation of the document contents. However, the problem of identifying the significant portions within a document is particularly difficult when dealing with images of the documents (bitmap image data), rather than with code representations thereof (e.g., coded representations of text such as ASCII). As opposed to ASCII text files, which permit users to perform operations such as Boolean algebraic key word searches in order to locate text of interest, electronic documents which have been produced by scanning an original without decoding to produce document images are difficult to evaluate without exhaustive viewing of each document image, or without hand-crafting a summary of the document for search purposes. Of course, document viewing or creation of a document summary require extensive human effort.
On the other hand, current image recognition methods, particularly involving textual material, generally involve dividing an image segment to be analyzed into individual characters which are then deciphered or decoded and matched to characters in a character library. One general class of such methods includes optical character recognition (OCR) techniques. Typically, OCR techniques enable a word to be recognized only after each of the individual characters of the word have been decoded, and a corresponding word image retrieved from a library.
Moreover, optical character recognition decoding operations generally require extensive computational effort, generally have a non-trivial degree of recognition error, and often require significant amounts of time for image processing, especially with regard to word recognition. Each bitmap of a character must be distinguished from its neighbors, its appearance analyzed, and identified in a decision making process as a distinct character in a predetermined set of characters. Further, the image quality of the original document and noise inherent in the generation of a scanned image contribute to uncertainty regarding the actual appearance of the bitmap for a character. Most character identifying processes assume that a character is an independent set of connected pixels. When this assumption fails due to the quality of the scanned image, identification also fails.
Further, one way of presenting selected portions of a scanned document image to the user is to emphasize those portions in some fashion in the document image. Heretofore, though, substantial modification of the appearance of a text image required relatively involved procedures.