OCR is a computerized method for converting printed or handwritten text from a scanned document into corresponding strings of character codes, such as ASCII codes. The OCR process typically includes several stages: First the text on the scanned document is segmented into individual characters. A pattern recognition algorithm is then applied to each character in order to find the likeliest match among the possible character codes. Because these steps are error-prone, they are typically followed by an error-correction step. For example, the computer may look up each OCR-generated word in a dictionary. The computer may automatically correct words that are not found in the dictionary by substituting the nearest match from the dictionary.
Dictionary-based OCR error correction typically uses an approximate string-matching algorithm to find the nearest match. Many of these algorithms are based on the notion of edit distance, as described, for example, by Damerau in “A Technique for Computer Detection and Correction of Spelling Errors,” Communications of the Association for Computing Machinery 7 (March, 1964), pages 171-176, which is incorporated herein by reference. The distance between two strings is determined by the number of edit operations that are needed to transform one string into another. This distance is commonly referred to as the “Levenshtein distance,” based on the work described by Levenshtein in “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Soviet Physics Doklady 8 (1966), pages 707-710, which is incorporated herein by reference.
Wagner and Fischer describe a dynamic-programming approach for efficient computation of edit distance in “The String-to-String Correction Problem,” Journal of the Association for Computing Machinery 21 (January, 1974), pages 168-173, which is incorporated herein by reference. This approach is widely used in string matching engines. The permitted edit operations for the purpose of edit distance computation include changing one symbol into another single symbol, deleting a symbol from a string, and inserting a symbol into a string. A non-negative cost γ is assigned to each such edit operation, wherein the cost of changing one symbol into another is typically inversely proportional to the likelihood of confusion between the symbols. (For example, in OCR, characters that are similar in appearance, such as O and Q, have a high likelihood of confusion and therefore a low cost.) The edit distance between two strings is given by the sum of the costs of the successive edit operations that are required to transform one string into the other. Since there may be more than one possible trace (defined as a sequence of edit operations) that can transform one string into the other, the minimum cost is taken over all the possible traces between the two strings.
Formally, the distance D(i,j) between strings A and B of respective lengths i and j may be determined using the algorithm defined in Table I below. In accordance with the notation defined by Wagner and Fischer, A<i> is the ith character in A; |A| is the length of A; Λ is the null string; and γ(a→b) is the cost of transforming character a into character b.
TABLE IMINIMUM EDIT DISTANCE COMPUTATION1.D(0,0) := 0;2.for i := 1 to |A| do D(i,0) := D(i–1,0) + γ(A<i>→Λ) ;3.for j := 1 to |B| do D(0,j) := D(0,j–1) + γ(Λ→B<j>) ;4.for i := 1 to |A| do5.for j := 1 to |B| do begin6.m1 := D(i–1,j–1) + γ(A<i>→B<j>);7.m2 := D(i–1,j) + γ(A<i>→Λ);8.m3 := D(i,j–1) + γ(Λ→B<j>);9.D(i,j) := min(m1, m2, m3);10.end
The method described by Wagner and Fischer determines edit distance in terms of single-character errors, i.e., substitution of one character for another or insertion or deletion of a single character. In OCR, however, dual-character errors are common due, for example, to incorrect segmentation. Thus, for example, the handwritten character “m” may be split into “r” and “n”, or “B” may be split into “1” and “3”. Other errors of this sort are well known in the art. To correct such an error using a single-character error model involves two editing steps: a substitution and a deletion. As a consequence, the computed edit cost of transforming the incorrectly-split characters (r and n, for example) back into the correct original character (m) will be high, and the computer may be unsuccessful in correcting this OCR error.
Seni et al. propose a solution to this problem in “Generalizing Edit Distance to Incorporate Domain Information: Handwritten Text Recognition as a Case Study,” Pattern Recognition 29 (1996), pages 405-414, which is incorporated herein by reference. They extend the basic dynamic-programming method for computing string differences to allow for merges, splits and pair substitutions (wherein one pair of letters is substituted for another pair due to incorrect segmentation). The extension is achieved by adding three new operations in the distance computation shown in Table I, corresponding to the incremental cost of a merge, split or pair substitution at each iteration. Implementing this approach requires modifications to string matching engines that are based on the algorithm of Wagner and Fischer, as well as development of a rationale for decisions about the relative costs to associate with the new operations: