1. Field of the Invention
The present invention relates to a method and apparatus for analyzing character strings generated by optical character recognition of handwritten character strings.
2. Description of Related Art
One of the primary demands for optical character recognition (OCR) is as a method to replace keypunching or hand entry of information from forms that were filled out by hand. Much of the information from these forms consists of words or character strings that are chosen from a list that is either explicitly defined for or implicitly known by the person filling out the forms.
One example of such a form is the list of various diseases that are explicitly stated or implicitly known to a person when completing an insurance application form. Another example comprises much of the information on the United States Census Form. One particular example from the United States Census Form is the ethnic background section, especially the implicit list of native American Indian tribes.
When trying to identify words read from forms that have been filled out by hand, problems beyond the normal spelling errors occur, and the error rate is much greater than for OCR of machine printed characters. When attempting to form optical character recognition of even reasonably clearly printed machine character strings, an OCR system will create insertion, deletion, substitution and segmentation errors. These normal OCR errors are compounded by normal handwriting errors. These errors include poorly formed letters, non-standard orientations, poor spacing between letters, and the normal variety in the types of pens and pencils used to write with.
Conventional word identification methods are quite sensitive to deletions, insertions and segmentation errors at various locations in character strings. Examples of such prior art methods are the methods used to verify spelling implemented with many word processors. However, the various method developed for checking the spelling in word processing and other applications are oriented towards identifying misspellings based on human typographical and cognitive errors.
For example, U.S. Pat. Nos. 4,730,269 and 4,580,241 to Kucera et al. discloses a method for transforming a misspelled word into a word skeleton by replacing letters with a general phonetic equivalent. Such a system is useless in attempting to correct OCR generated misspellings, as OCR errors have no relationship to the cognitive human errors discoverable by the phonetic skeleton scheme of Kucera et al.
Another example is U.S. Pat. No. 4,903,206 to Itoh et al., which discloses a method for ensuring that the correct character string for a misspelled character string is in a selected list of possible correct character strings chosen from a larger dictionary. The method of Itoh et al. assumes (correctly for typographical and cognitive errors) that characters having the lowest frequency of use have the highest probability of being correct. Such an assumption makes the method useless in correcting OCR-generated errors, as the likelihood of a character being incorrectly included or excluded from an OCR-generated character string is dependent upon the way an individual prints.
These methods can identify any number of possible words to replace the misspelled word when the misspelling is caused by typing or cognitive errors. However, few of these methods can positively identify the correct word even when the spelling errors are rather minor, and they have great difficulty with common OCR errors.