Optical character recognition, or OCR, is the process of transforming a graphical bit image of a page of textual information into a text file wherein the text information is stored in a common computer processable format, such as ASCII. The text file can then be edited using standard word processing software.
In the process of transforming each of the characters on the page from a graphical image into an ASCII format character, prior art OCR methods first break the graphical page image into a series of graphical images, one for each character found on the page. They then extract the features of each character and classify the character based on those features. If the characters on the page are of a high quality, such as an original typed page, simple processing methods will work well for the process of converting the characters. However, as document quality degrades, such as through multiple generations of photocopies, carbon copies, facsimile transmission, or in other ways, the characters on a page become distorted causing simple processing methods to make errors. For example, a dark photocopy may join two characters together, causing difficulty in separating these characters for the OCR processing. Joined characters can easily cause the process that segments characters to fail, since any method which depends on a "gap" between characters cannot distinguish characters that are joined. Since the feature extraction and character classification parts of the process assume that the segmentation was done correctly, they will give erroneous results on joined characters.
Light photocopies produce the opposite effect. Characters can become broken, and appear as two characters, such as the character "u" being broken in the bottom middle to create two characters, each of which may look like the "i" character. Also, characters such as the letter "e" may have a segment broken to cause them to resemble the character "c".
Early prior art OCR methods did not extract character features from a character, instead they simply compared a graphical bit map of the character to a template bit map of a known character. This method was commonly called "matrix matching". One problem with matrix matching is that it is very sensitive to small changes in character size, skew, shape, etc. Also, this technology was not "omni font", that is, it had to be carefully trained on each type font to be read and would not generalize easily to new type fonts.
To solve the "omni font" problem, prior art methods begin to extract higher level features from a character image. The goal was to select a set of features which would be insensitive to unimportant differences, such as size, skew, presence of serifs, etc., while still being sensitive to the important differences that distinguish between different types of characters. High level features, however, can be very sensitive to certain forms of character distortion. For example, many feature extractors detect the presence of "closures", such as in the letters "e", "o", "b", "d", etc., and the feature extractors use this information to classify the character. Unfortunately, a simple break in a character can easily cause a closure to disappear, and the feature extractor method that depends on such closures would probably classify the character incorrectly.
Often the high level feature representation of a character contains very few features. Therefore, when a feature is destroyed, such as a break in a closure, there is insufficient information left to correctly classify the character.
There is need in the art then for an optical character recognition system that classifies characters by creating a set of features that is insensitive to character segmentation boundaries. There is further need in the art for such a system that creates features having a low enough level to be insensitive to common noise distortions. Another need in the art is for such a system that creates a sufficient number of features that some will remain to allow character classification even if others are destroyed by noise. A still further need in the art is for such a system that provides a set of features that are insensitive to font variations. The present invention meets these needs.