1. Field of the Invention
This invention relates generally to systems and methods for retrieving electronic documents using an image or portion of the electronic document. More particularly, the present invention is related to systems and methods for retrieving electronic documents by converting the electronic documents to synthetic text, converting an input image to synthetic text, and comparing the synthetic text for a match.
2. Description of the Related Art
The retrieval of electronic documents in the presence of noise or only given a very small part of the document is a very difficult problem. The larger the collection of documents, the more difficult the problem is. For example, retrieving an electronic text document given a blurry or illegible image of a portion of a printed page taken with a camera cell phone is difficult when the corpus of is large. This problem 100 is illustrated by FIG. 1A that shows an example image input 102 and a corresponding original electronic document 104. Furthermore, identifying the location 106 in the document 104 and the corresponding text 106 is even more difficult. This problem is only increased with the proliferation of low quality cameras and image capture devices and the ease in which they can be used to send the images.
One attempt by the prior art to solve this problem is to extract features of the image and use an index to retrieve documents containing a majority of the features. For example, inverted files are used to index individual image features. However, the features do not provide enough context information for consistent and accurate matches. Moreover, because of the poor quality of the input image it is often difficult to identify the features in the input image. Even when features can be identified in the input image, the noise degrades the information such that it is often not sufficient to find a matching electronic document. In other words, the features are incorrectly recognized leading matches to the wrong documents.
Another approach is to apply Optical Character Recognition (OCR) on the input image and the use the output of the OCR process to search the text strings of in the document. However, this suffers from the same problems noted above, namely that the image quality is so poor that OCR cannot be effectively performed. Even when it is performed, the error rate in the recognition is so high as to make the search ineffective because the matching documents are so unrelated to the original image.