This invention relates to optical character recognition systems and, more particularly, to line matching processes for such systems.
Optical character recognition (OCR) devices from different manufacturers often differ in their ability to recognize particular characters. It is common in some OCR applications to use multiple OCR devices to process digital images of scanned documents and then merge the results. If an individual OCR device misreads a character, the merging process, or voting process, will disregard that character identification and choose a character identified by a plurality of the other OCR devices or a character identified by an OCR device that has a known propensity to recognize that character. The voting process typically produces a more accurate result that that produced by any single OCR device. The voting process requires that the text files created by the OCR devices be line and character aligned. If the text files are not line and character aligned, the voting process usually is inoperative.
The line aligning process is a determination of whether lines of text between a plurality of text files are matching. One common technique for line matching employs the Hickey-Handley algorithm described in "Merging Optical Character Recognition Outputs for Improved Accuracy," by John C. Handley and Thomas B. Hickey, RIAO 91 Conference. This algorithm computes a string edit distance between lines of the text files under comparison. A string edit distance is the minimum number of insertions, deletions, and substitutions required to transform one line into the other. A record is kept of the string edit distances between a line of a first text file and all of the lines of a second text file. The string edit distances are then compared to identify matching lines. Matching lines are those that yield the minimum string edit distance.
The Hickey-Handley algorithm is also commonly used for character aligning files. Using this algorithm, characters contained in matching lines are positioned within those lines such that the alignment yields a minimum string edit distance.
After the text files are line and character aligned, a voting process compares matching lines on a character by character basis. Predetermined voting rules are utilizing to select characters for output.
Previous line aligning processes, such as the Hickey-Handley algorithm, are computationally intensive and, therefore, require a considerable amount of processing time. If numerous files must be line aligned using these processes, efficiency is greatly impacted. The present invention is a process for line aligning that is less computationally intensive and more efficient than previous processes.