The present invention relates generally to digital images, and more particularly to searching for objects, text or handwriting within a static digital image document, real-time stroke data, or the like.
Technology today provides many ways for people to trade electronic documents, such as by disk, e-mail, network file transfers, and the like. In addition, many systems are available that allow a hard copy of a document to be digitized and made electronically available, such as through the use of optical scanners or fax machines. One problem with the digitized version of the document is that the electronic file is typically an image file rather than a textual file, and hence is much more difficult to edit by computer. As important, the image files cannot be electronically searched for instances of a particular text string or other string. Rather, generally, the user is left to manually view the image file representation of the document looking for the desired term. Obviously, this particular method is labor intensive and subject to human error.
Consumer software applications may include an optical character recognition (OCR) component to convert the image file to a textual file. Using OCR applications allows a user to search for particular instances of a query string, however, the confidence of actually finding every instance of that query string may be low. The recognition process occasionally mis-recognizes letters or combinations of letters with similar shapes, causing errors to appear in the resulting text. Typical error rates on a high-quality image can vary widely depending on the complexity of the layout, scan resolution, and the like. On average, for common types of documents, error rates for OCR are often in the range of 1% to 10% of the total characters on the page. These errors greatly diminish the user""s confidence of locating every instance of a query string from within a document that started out as an image file. A solution to this problem has eluded those skilled in the art.
Briefly stated, the present invention provides a system and method for improved string matching within a document created under noisy channel conditions. The invention provides a method for identifying, within a document created by a noisy conversion process (e.g., OCR), potential matches to a user-defined query and the likelihood that the potential matches satisfy the query. Satisfaction can be determined by identifying whether any difference between the potential match and the query is likely the result of an error in generating the document. That identification may be made with reference to a pre-constructed table containing data indicating the probability that a particular error occurred during the noisy document conversion. Additionally, the invention provides optional steps to further assess the likelihood of the match. Such optional steps may include the use of OCR confidence data, word heuristics, language models, and the like.
In one aspect, the invention provides a system for identifying string candidates and analyzing the probability that the string candidate matches a user-defined query string. In one implementation, a document text file is created to represent a document image file through a noisy conversion process, such as OCR. A find engine searches for matches to a query string to within a defined tolerance. Any match that differs from the query string by no more than the defined tolerance is identified as a candidate. The find engine then analyzes the difference between each candidate and the query string to determine if the difference was likely caused by an error in the noisy process. In that determination, reference is made to a confusion table that associates common errors in the noisy process with probabilities that those errors occurred. Candidates meeting a probability threshold are identified as a match. Optionally, this probability threshold may be adjusted by the user to dynamically narrow or widen the scope of possible matches returned by the find engine. The invention further provides for analysis options including word heuristics, language models, and OCR confidences.
In another aspect, the invention may be implemented as a computer-readable medium, having computer executable-instructions for performing steps including receiving a query string request to locate every instance of the query string in a document image file, converting the document image file into a document text file, parsing the document text file to identify data strings that may be the query string, and analyzing the data strings to identify a probability that each of the data strings is the query string.