The following relates to the optical character recognition (OCR) arts.
Optical character recognition (OCR) refers to the use of image processing to identify and extract textual content from an image. In a typical OCR processing sequence, the image is generated by optically scanning a printed page (hence the conventional term “optical” character recognition), the scanned image is analyzed to identify blocks or “zones” which are classified as text zones or non-text zones (for example, images), the text zones are rotated to align the text with the “horizontal”, and suitable pattern matching techniques are employed to match and identify images of letters, digits, or other textual characters.
Existing OCR techniques have a high recognition rate for typed text employing a Latin or Latin-derived alphabet. The OCR accuracy generally decreases for other character sets, and for handwritten text, and OCR accuracy may also depend on font type, font size, or other text characteristics, optical scan quality, and other factors. In a favorable setting (good image quality, Latin alphabet, et cetera) OCR recognition rates of order 99% or higher are achieved using existing OCR systems. Nonetheless, further improvement in OCR recognition would be advantageous.
One approach for improving the OCR recognition is to employ a dictionary or lexicon to perform spell correction. These approaches can be beneficial, but the improvement is dependent on the comprehensiveness of the dictionary or lexicon, and in some instances spell correction can actually introduce errors (for example, by “correcting” the spelling of a correctly spelled word that is not in the dictionary or lexicon).
The following sets forth improved methods and apparatuses.