State of the art optical character recognition machines, typically have two principal types of character misrecognition modes: Substitution and segmentation. Substitution manifests itself in two ways. The first is character substitution, where the recognition unit has captured the video information of a single character, but the features required for alphabetical determination are aliased as another character. Logically this can only occur if there if some degree of similarity in the shape of the respective alphabetic characters involved. Examples of such character combinations are: B, D; D, O; O, C; l, i; etc. The second form of substitution manifestation is the character reject. As in character substitution, the recognition unit captures a single character. However, rejection occurs because of the inability of the recognition logic to relate to any character or because more than one set of alpha determination criteria are satisfied by the character features isolated. This condition is referred to as a character reject. In the prior art, apparatus for selecting the correct form of a garbled input word misread by an OCR has been limited to correcting errors in the substitution misrecognition mode. For improving the performance of an optical character reader, the prior art discloses the use of conditional probabilities for simple substitution of one character for another or of character rejection, for calculating a total conditional probability that its input OCR word was misread, given that a predetermined dictionary word was actually scanned by the OCR. But the prior art deals only with the simple substitution of confusion pairs occupying the same corresponding location in the OCR word and in the directory word. The OCR word and the directory word must be of the same length. The prior art neither recognizes nor addresses the problem of the optical character reader's segmentation misrecognition mode.
Segmentation misrecognition differs from that of simple substitution in that its independent events correspond to groupings of at least two characters. Nominally there are three types of segmentation errors. They are: horizontal splitting segmentation, concatenation segmentation, and crowding segmentation. The underlying mechanical factor which all the above segmentation types have in common is that they are generated by the improper delineation of the character beginning and end points. Segmentation errors occur quite frequently in OCR output streams and constitute a substantial impediment to accuracy in text processing applications.