1. Field of the Invention
This invention relates to image based document management, and in particular, it relates to image based document indexing and retrieval.
2. Description of Related Art
In an image based document management system, document indexing refers to storing images of document in association with information regarding the document (index information) in a document database; document retrieval refers to retrieving desired document images for review, manipulation, management or other purposes, such as for comparing a stored document image with a scanned image of a hard copy document. A common type of document image indexing and retrieval method relies on a document ID placed on the document; its images are stored in a database along with the document ID for document management purposes. For example, a printed document can be scanned back and the document ID carried on the printed document is read, the stored image is retrieved from the database based on the document ID, and the stored image may be compared to the scanned image of the printed document. The document ID may be carried on the document itself either explicitly as alphanumerical symbols or barcodes (such as UPC code, OR code, etc.), or implicitly as watermarks, decorative glyphs or other data hiding patterns that are not perceptually visible.
In certain applications, explicit marks on the documents are considered intrusive and not acceptable to customers. Implicit data hiding methods are generally sensitive to noise. In other cases, the added document ID, either explicit or implicit, may be damaged, contaminated, or missing during print-and-scan or document circulation processes. Document image indexing and retrieval systems using document characteristics and/or image features, if implemented properly, are more reliable than the methods that rely on document ID.
A number of methods have been proposed for retrieval of document images. D. Doermann, The Indexing and Retrieval of Document Images: A Survey (1998), available on the Internet at http://lampsrv02.umiacs.umd.edu/pubs/TechReports/LAMP—013/LAMP—013.pdf, summarizes the advances in this area up to 1998. Existing document image retrieval methods can be classified into two categories. The popular approach is to use some text string codes that are obtained via user input, annotations, and/or by Optical Character Recognition (OCR). Examples include U.S. Pat. Nos. 4,748,678, 5,628,003, 5,628,003, 7,751,624 and US Patent Application Publication No. 2008/0162603. These methods are language dependent due to the utilization of OCR or user input. The second approach is image based. Image based document retrieval can be further separated into two types: (1) usage of document layout and zone/block information, for example, U.S. Pat. Nos. 5,926,824, 6,002,798 and US Patent Application Publication No. 2008/0244384 A1; (2) usage of some image features, for example, U.S. Pat. Nos. 5,943,443, 7,475,061 and 8,036,497 use character features or word level topology, U.S. Pat. Nos. 6,397,213 and 8,027,550 extract features from document zone/blocks, and U.S. Pat. No. 7,912,291 employs bit features in compressed JPEG format. Many of the aforementioned methods require user interactions to carry out retrieval correctly due to the deficient distinctiveness of the retrieval information.