The invention relates to optical object recognition (OOR), particularly to optical character recognition (OCR), and post-processing techniques therefor. The invention particularly concerns optical character recognition (OCR) systems that receive an image of a document, separate the image into blobs that contain characters, and analyze the blobs to recognize and extract characters from the blobs.
Current OCR systems use a variety of approaches including template matching, statistical correlation, and font mapping. They typically perform an initial OCR in which the systems break a document image into blobs containing images of characters, try to recognize and extract the characters, and present the recognized characters in their original order. Unfortunately, the initial OCR leaves many unrecognized blobs that each contain more than one character or that contain fragments of characters.
We consider OCR systems a subset of a more general category we call xe2x80x9coptical object recognition (OOR)xe2x80x9d systems, where document images are arrays of data to be analyzed, characters are the particular objects recognized, and the blobs are xe2x80x9celementsxe2x80x9d of the arrays of data to be analyzed. For simplicity, we discuss our invention and practice and describe the exemplary embodiment of our invention in terms of OCR systems. Our preferred initial OCR method also leaves such unrecognized blobs after a first pass. As applied to text recognition, the initial OCR routine assumes that each blob is an individual character in a known font set, a set that the routine has been taught in one manner or another before it is run. This assumption fails in two situations: when the OCR routine cannot distinguish two or more characters by their spatial characteristics and thus merge the characters into one blob; and when the OCR routine misinterprets a character""s spatial characteristics and splits it among two or more blobs. The first situation typically arises when there is inadequate spacing between characters, as illustrated in FIGS. 1 and 3-6. In FIGS. 4 and 5, Blobs 1, 2, 3, and 5 contain single characters that the OCR will handle with no problem. However, in FIGS. 4 and 6, Blob 4 includes xe2x80x9c345xe2x80x9d, Blob 6 includes xe2x80x9c78xe2x80x9d, and Blob 7 includes xe2x80x9c90xe2x80x9d, none of which will be recognized because of the merger of multiple characters therein into single blobs. The second situation typically arises when there is unusual formatting (italics, perhaps), an unusual font (such as MICR as shown in the bottom row of FIGS. 2 and 7 with conventional font equivalents above), or light printing. In the case of the MICR characters of FIG. 2, the OCR routine will likely split each character into three blobs, none of which will be recognized. Thus, as shown in FIG. 7, the OCR routine will place the more conventional characters of the top row into individual Blobs 8, 12, 16, and 20, but will split their MICR equivalents in the bottom row into multiple Blobs 9-11, 13-15, 17-19, and 21-23. Our initial OCR routine locates and determines the sizes of blobs in a region of interest (ROI) in an image of a document that are to be recognized by the OCR routine.
To reduce the number of unrecognized blobs, many current OCR systems include post-processing routines that take another look at blobs left by the initial OCR. These post-processing systems reduce the number of unrecognized blobs, but still have lower than desired success rates and suffer from a lack of robustness. Thus, a need exists for a more robust OCR system and method that can separate and recognize characters that are not distinguishable by their spatial characteristics with a higher success rate than current OCR systems and methods. Another need exists for a more robust OCR system and method that can recognize characters that would ordinarily be left unrecognized because of unusual spatial characteristics leading prior art OCR systems and method to break the characters into multiple blobs.
Our invention satisfies this need using an improvement on existing OCR methods; the invention applies OCR to a document iteratively to simplify, strengthen, improve, and accelerate analysis as compared to prior art methods. Our improvement lies in the application of a post-processing routine that analyzes the blobs left unrecognized after the initial OCR routine is done. First, the post-processing routine breaks the ROI into unknown regions. Next, the routine analyzes each unknown region separately by attempting a correlation of the unknown region with a character from the known font set, starting at the upper left corner of the unknown region. In other words, the system has a set of character templates from the known font set and it uses a correlation coefficient between a current unknown character and each of the set of character templates to see how well it matches the templates. If there is a good match, the unknown region size is reduced by the width of the recognized character and the correlation is attempted again on the reduced unknown region. This is repeated until the entire unknown region is recognized or until every character in the font set has been tried. A variation of the correlation sequence moves the template around in the unknown region rather than only trying the upper left corner. When this method is applied to merged characters, the post-OCR analysis recognizes the individual characters contained in the blobs not recognized by the initial OCR. Once this is complete, the recognized characters are reordered to reflect their arrangement in the original image.
Our method then regroups unrecognized blobs left by the merged character analysis into new unknown regions for another pass of the OCR routine. Leftover blobs that meet predetermined criteria, such as having particular spatial relationships, and which meet the conditions of an unknown region as defined in the merged character recognition routine are placed into the unknown regions. Once the routine defines the unknown regions, it applies OCR to recognize characters that have been split among two or more blobs.
We prefer to use an adaptive learning routine for sequencing the order in which characters are correlated in our new routine. This causes the system to correlate a character according to the probability that a character is merged into another character. This probability is based on the frequency of observed mergings of each character in the observed font set. The character with the highest probability value is tried first in the correlation sequence of the new routine.
OCR with our improved method enjoys accuracies at least as high as prior art methods and is more robust to boot.