In pattern recognition techniques, including optical character recognition (OCR), digital images are processed to recognize features that may convey information about the image. In the example of a character recognition system, a digital image is processed to identify characters or character-like information. Shapes are classified and mapped to known characters. A significant problem in such systems is that it is desirable to provide detection for a pattern that may have significant variations from occurrence to occurrence, while at the same time, providing high confidence detection. For example, with reference to FIG. 1, many occurrences of the characters labeled A and B or C and D are virtually impossible to distinguish. Thus, detection is ambiguous, since either of the possibilities may be correct. The font or print quality may enhance or detract further from the possibility of correctly identifying the character represented in the image.
It is well known that providing "context" or external information for a pattern identification process aids in increasing the confidence of correct detection, or decreases ambiguity. For example in an OCR process, checking a recognized string of characters against a dictionary of words, may serve to increase the confidence of correct detections of each character. Additionally, if the dictionary check shows close similarity to a word with the exception of one or two characters difference, the OCR process may return the word corrected in accordance with the dictionary check. Additional examples of context include commonly noted letter combinations, word spelling, word parts of speech and comparison to known a priori information. Note for example, U.S. Pat. No. 4,876,731 to Loris et al.
Forms processing uses spatial position of detected information as context. See, for example, U.S. Pat. No. 4,949,392 to Barski et al. For the purposes of this discussion, a structured document will be defined as a commonly-used hard copy document that requests entry of symbols, typically alphanumerics, but not limited thereto, at specific spatial locations on the hard copy document. Particular examples include tax forms, job application forms, insurance forms, etc. Also included, however, are other highly structured documents such as business letters, memos, facsimiles, specialized reports, scientific papers and legal papers, etc. which by custom or specification follow defined formats. To derive information from these structured documents, a hard copy sheet(s) is scanned by a digital input scanner to derive a digital image representing the hard copy document. In one standard form processing arrangement, a form template identifies data entry fields by location (commonly a set of x-y coordinates) on the page. The template may further define the contextual clues that may be used to check and constrain the OCR results. For example, it may be known that the required entry into a data entry field at position x,y on the sheet is a United States Social Security Number. Only numeral characters can appear in such an entry, and the format for such an entry should be 3 digits-2 digits-4 digits. Referring back to the example of FIG. 1, the alpha letter "O" (labeled A) and the numeral "0" (labeled B) could clearly be distinguished in such a context.
While accuracy of OCR can be dramatically increased by using a form template, it will be clear that such an arrangement is not perfect. A primary problem is that over time, structured documents change. Even if the information requested remains the same (and it rarely does) the physical arrangement of data entry fields on a hard copy document is changed regularly for a variety of reasons. To use a template-based OCR system, the template must be updated each time that the structured document is changed. In some complex systems, a library of structured documents may be created, with an added structured document detection function that distinguishes between instances of structured documents. Such processing is problematic.
In U.S. patent application Ser. No. 07/814,552 (also published as WO 941957, Sep. 1, 1994) entitled "Software Product for Categorizing Strings in Character Recognition" by Kaplan, Shuchatowitz and Mullins (hereinafter, Kaplan et al.), filed Dec. 30, 1991, and assigned to the same assignee as the present application, it is proposed that recognition can be based on lexical class. Generally, a character recognition process operates on the basis of character strings, directing to a classifying processor a matrix of possible strings, accompanied by indicators of correctness. A classifying processor "compares" each possible string to preprogrammed class or grammar rules, to determine whether the string conforms to one or more sets of rules. One or more class identifiers are then attached to the data representing a possible character string, according to whether the string meets the rules. Commonly, this process will eliminate invalid strings. If no known set of rules is met, returned data will indicate such a failure. Lexical classification has many advantages, including increase in recognition accuracy in numbers and other non-word strings.
Also of interest are: U.S. Pat. No. 4,654,875 to Srihari, which describes a method of linguistic string analysis, based on likelihood of certain letters occurring serially. U.S. Pat. No. 5,159,667 to Borrey et al. recognizes global document features based on a comparison to known document types. U.S. Pat. No. 5,052,043 to Gaborski shows a neural net system of use in OCR systems. U.S. Pat. No. 4,750,122 to Kaji et al. shows segmentation of text into words on which a dictionary search may be made. U.S. Pat. No. 5,251,294 to Abelow teaches accessing available sources of information, extracting components, labeling the components and forming them into discrete units called contexts.
References disclosed herein are incorporated by reference for their teachings.