1. Field of Invention
The present invention relates to string processing and, in particular, to automatic string correction. The preferred embodiment of the present invention relates to the processing of strings generated by Optical Character Recognition and, in particular, to the automatic correction thereof.
2. Discussion of Related Art
Optical Character Recognition (OCR) consists of recognizing a string of characters in an image and returning a corresponding string of characters (e.g. in text form). A typical OCR process comprises the steps of acquiring an image containing a string of characters, segmenting the image to isolate individual characters, recognizing each individual character as a character of an alphabet, and returning a string of characters.
OCR has a wide range of applications including the recognition of vehicle license plate numbers (for use in automated traffic law enforcement, surveillance, access control, tolling, etc.), the recognition of serial numbers on parts in an automated manufacturing environment, the recognition of labels on packages for routing purposes, and various document analysis applications.
Despite the sophistication of present OCR techniques, OCR errors frequently occur due to the non-ideal conditions of image acquisition, the partial occlusion or degradation of the depicted characters, and especially the structural similarity between certain characters (e.g. Z and 2, O and D, 1 and 1). For example, the recognition of vehicle license plate numbers must overcome lighting conditions that are both variable (according to the time of day, weather conditions, etc.) and non-uniform (e.g. due to shadows and specular reflection), perspective distortion, and partial occlusion or degradation of the characters (e.g. due to mud, wear of the paint, etc.).
To improve the overall performance of OCR systems, it is essential to include a post-processing stage, during which OCR errors are automatically detected and corrected.
A popular technique to automatically correct errors in words is “dictionary lookup”: an incorrect word, that is one that does not belong to a predefined “dictionary” of valid words, is replaced by the closest valid word in the dictionary. This is often achieved by selecting the dictionary word yielding the minimum “edit distance” with the incorrect word. The edit distance between two strings is the minimum number of edit operations (deletions, insertions, and substitutions) required to transform the first string into the second string. The edit distance has been generalized by assigning a weight to an edit operation according to the type of operation and/or the character(s) of the alphabet involved in the operation.
Methods of automatic string correction based on the dictionary lookup paradigm are useful in cases where valid input strings are those belonging to a limited dictionary of valid strings. However, they are inadequate to correct strings that are not of the word-type. There are an increasing number of OCR applications in which valid strings are not words but strings satisfying a “template” of some sort; such strings include vehicle license plate numbers, serial numbers, ID numbers, ZIP codes, etc. Consequently, there is a growing need for a method to correct such strings.