1. Technical Field
The invention relates to a method and computer program product for refining the segmentation of digitally scanned text in an optical character recognition (OCR) system. OCR systems rely on pattern recognition devices (classifiers) for character recognition.
2. Description of the Prior Art
Optical character recognition (OCR) is the process of transforming written or printed text into digital information. Pattern recognition classifiers are used in sorting scanned characters into a number of output classes. A typical prior art classifier is trained over a plurality of output classes using a set of training samples. The training samples are processed, data relating to features of interest are extracted, and training parameters are derived from this feature data. During operation, the system receives an input image associated with one of a plurality of classes. The input image is segmented into candidate objects and passed to a classifier. The relationship of each of the candidate objects to each class is analyzed via a classification technique based upon the training parameters. From this analysis, the system produces an output class and an associated confidence value for each of the candidate objects input to the classifier.
Ideally, all samples in an OCR system would be properly segmented into recognizable characters. In practice, however, a number of characters will be improperly split or merged by the segmentation process. Even a small error in the printing or writing of the original or in the scanning of the sample can result in improper segmentation. In most systems, improperly segmented characters will not be recognized by the classifier, necessitating repeated human intervention in the process.
Single character recognition has achieved accuracy levels on the order of ninety-nine percent. In some applications, however, such as mail processing, outside influences can reduce the scanning quality of images to cause characters to become touching or separated. These modified characters must be identified and either combined or separated in order to correspond to the actual input data. If not handled properly, these scanning imperfections will cause character recognition rates to drop significantly, requiring additional processing to return the localized character image to a state similar to the original or classify the imperfect character image so that it can be mapped to a single-character classifier.