1. Cross References to Related Applications
The following concurrently filed U.S. patent applications are hereby cross-referenced and incorporated herein by reference in their entirety.
"Rapid Detection of Page Orientation", U.S. patent application Ser. No. 07/794,551 to Dasari et al., filed Nov. 19, 1991, now U.S. Pat. No. 5,276,742 issued Jan. 4, 1994.
"Method for Determining Boundaries of Words in Text", U.S. patent application Ser. No. 07/794,392 to Huttenlocher et al. filed Nov. 19, 1991, now U.S. Pat. No. 5,321,770 issued Jun. 14, 1994.
"A Method of Deriving Wordshapes for Subsequent Comparison", U.S. patent application Ser. No. 07/794,391 to Huttenlocher et al. filed Nov. 19, 1991.
2. Field of the Invention
This invention relates to improvements in methods and devices for scanned document image processing, and more particularly to improvements in methods and devices for detecting function words in a scanned document without first converting the scanned document to character codes.
3. Discussion of Related Art
A common problem in computer-based document processing is the separation of content words from function words for applications such as document retrieval and browsing. Function words include determiners, prepositions, particles, and other words that play a largely grammatical role, as opposed to words such as nouns and verbs that convey topic information. It is important to distinguish these categories for methods that rely on word frequency because while function words are the most frequently occurring lexical items in language, they modify, rather than determine, the contents of a document.
Typically, function words can be isolated using a stop-list, which is merely a list of predetermined function words. However, a problem with distinguishing function words surfaces in computer-based document processing applications that operate on image data (instead of on character code representations of text, e.g., ASCII). For instance, the length of a word itself is not determinative of function or non-function word status.
Reed U.S. Pat. No. 2,905,927 describes a method and apparatus for recognizing words where three scans are employed to determine the characteristics (i.e., pattern) of the word to be identified. An upper scan obtains information indicating the number and position of full-height symbols while a lower scan devises information indicative of symbols extending below the base line. A center scan acquires information relative to the number of symbols in the word and the symbol spacing.