1. The Field of the Invention
The present invention relates generally to the field of optical character recognition (OCR). More specifically, the present invention relates to a system and method for creating a searchable word index of a scanned document including multiple interpretations of a word at a given location within the document.
2. Technical Background
In the field of optical character recognition (OCR), analog documents (e.g., paper, microfilm, etc.) are digitally scanned, segmented, and converted into text that may be read, searched, and edited by means of a computer. In order to provide for rapid searching, each recognized word is typically stored in a searchable word index with links to the location (e.g., page number and page coordinates) at which the word may be found within the scanned document.
In some conventional OCR systems, multiple recognition engines are used to recognize each word in the document. The use of multiple recognition engines generally increases overall recognition accuracy, since the recognition engines typically use different OCR techniques, each having different strengths and weaknesses.
When the recognition engines produce differing interpretations of the same image of a word in the scanned document, one interpretation is typically selected as the “correct” interpretation. Often, the OCR system rely on a “voting” (winner takes all) strategy with the majority interpretation being selected as the correct one. Alternatively, or in addition, confidence scores may be used. For example, suppose two recognition engines correctly recognize the word “may” with confidence scores of 80% and 70%, respectively, while another recognition engine interprets the same input data as “way” with a 90% confidence score, while yet another recognition engine recognizes the input data as “uuav” with a 60% confidence score. In such an example, a combination of voting and confidence scores may lead to a selection of “may” as the preferred interpretation.
Unfortunately, by selecting a single interpretation and discarding the rest, the objectively correct interpretation is also frequently discarded. Often, image noise and other effects confuse a majority of the recognition engines, with only a minority of the recognition engines arriving at the correct interpretation. In the above example, the correct interpretation could have been “way,” which would have been discarded using standard methods. Accordingly, conventional OCR systems have never been able to approach total accuracy, no matter how many recognition engines are employed.
What is needed, then, is a system and method for creating a searchable word index of a scanned document including multiple interpretations of a word at a given location within the document. What is also needed is a system and method for creating a searchable word index that selectively reduces the size of the index by eliminating interpretations that are not found in a dictionary or word list. In addition, what is needed is a system and method for creating a searchable word index that permits rescaling of a scanned document without requiring modification of location data within the word index.