Numerous organizations must review and extract information from large repositories of imaged documents. Document images may contain information in a variety of languages and can be printed or handwritten. These document images are not able to be directly searched using typical information retrieval techniques because the contents are represented as pixel collections instead of computer language characters.
Organizations attempting to exploit information from image document pages are the very last link in a complex chain of circumstances that effect the quality of the pixel collections that the organization is attempting to manipulate.
First, the page producer selects a typeface and size (e.g. Times 14) for use in imaging the textual information, producing a particular visual appearance. Next, the page producer selects a particular hardware device (e.g. Epson 940 inkjet printer) to produce the paper copy; different printers will affect the visual representation significantly. After production, the page may be subjected to a variety of processes that may alter the visual representation of the page. The page may be copied using copier devices that introduce distortions or other visual artifacts. The page may be subjected to environmental insults such as being crumpled or obscured with dirt or liquid. Finally, when the page is to be scanned into the database that the system will be using for search operations, the visual representation of the page image will be influenced by the quality and characteristics of the scanner used as well as the quality of the scanning technique employed.
Most approaches to the problem of searching imaged documents start with an initial step of converting written content from an image format to electronic text. Traditional solutions are based on optical character recognition (OCR) techniques, which have numerous problems. First, as discussed above, document images may be in less than ideal condition. Distortion, rotation, duplication artifacts, or transmission noise may be present and can preclude effective OCR processing. Second, the OCR conversion process can be too slow to cope with required document processing speeds. Third, normal error rates in OCR conversion have a significant negative impact on downstream use of the textual information. Fourth, there are many languages for which there are no OCR conversion engines at all or no engines of acceptable quality.
Because of these problems with existing practices, an approach was needed to search for arbitrary written information contained in imaged documents directly eliminating the OCR process. This approach of the present invention is called optical word recognition (OWR). The present invention advantageously uses techniques to search for arbitrary textual information contained in imaged documents. The result is a significant advance in high-speed search for textual information within imaged documents. The present invention can be used, for example, in language identification, signature identification and signature detection. It is especially useful in searching for the images in large databases.