Many printed documents are being digitized so that their content can be searched. Optical character recognition (OCR) is the main process used to digitize printed text and involves recognition of printed characters and conversion of the recognized characters to ASCII code. The resultant code can be searched as text. There are a number of factors which can cause problems in the performance of OCR and can result in misrecognition and these include poor image resolution, the scanned paper quality and font shape. Additionally, some languages have very challenging orthographic features which lead to poor results with OCR. Such languages include Arabic, Urdu, Pashto etc. Poor results from OCR lead to reduced effectiveness in information retrieval (IR) when searching the text.
A number of solutions have been proposed to address the problem of performing IR from printed texts. Some of these solutions address IR effectiveness on OCR output, for example using query degradation based on a character error model of the recognized text or by finding the best index term for the degraded text. Other solutions have looked at performing text correction on the OCR output. However, these solutions are not effective when error rates in the original OCR process are high.
An alternative approach is to perform the search in the image domain, rather than the text domain, which avoids the need to perform OCR. In such an approach, the text query is converted into an image query and then the document image is searched for occurrences of the image query. Whilst this approach can achieve better results for documents that result in high error rates in OCR, performing image searches requires a large amount of processing power and as a result is not scalable to large collections of documents.
The embodiments described below are not limited to implementations which solve any or all of the disadvantages of known methods of performing IR from printed text.