This invention relates to document management systems and more particularly to providing a method for processing document image information in a database of document images.
The proliferation of low-cost, high-capacity electronic storage of document images has enabled users to keep ever increasing amounts and varieties of documents, previously stored in hard copy format, as electronic information online. While this revolution in storage technology has reduced the cost of document storage, it brings with it the need for more efficient methods of searching through a myriad of online documents to find a particular document or set of documents of interest to the user.
Methods for locating a document of interest have been rudimentary at best. Typically, in these methods, documents are scanned into the computer and an Optical Character Recognition ("OCR") program converts the image into a textual file. Next, a form of keyword matching search is performed, with the system either scanning the entire text of all documents, or a set of carefully chosen keywords thought to be representative of the document by a person who initially classified the document. The problem with the first approach is the high search cost involved with traversing a large number of documents in their entirety. The difficulty with the second approach is that different persons will employ different strategies to filing and retrieval. As the heterogeneity of documents contained in databases increases, the reliability of these traditional search methods diminishes.
Recognizing the opportunity to exploit the information content of the image portion of documents, several attempts have been made to search for documents based upon matching of small images contained in the documents. For example, M. Y. Jaisimha, A. Bruce and T. Nguyen in their work, "DocBrowse: A system for Textual and Graphical Querying on Degraded Document Image Data" describe a system which searches for documents based upon company logos in letterheads. D. Doermann, et. al. in "Development of a General Framework for Intelligent Document Retrieval," outline a system for matching documents based upon generation and matching of an image descriptor which describes low-level features and high-level structure of a document. Unfortunately, this method requires intensive processing of the image information, which greatly curtails its use in most commercial applications.
While such methods provide document search capability via elemental matching of image characterization vectors, they do not provide the basis for extracting image information useful to organize a large database of document images. Additionally, since these methods apply to grayscale images, further work needs to be done to accommodate a database of binary images. These and other shortcomings indicate that what is needed is a method and system for efficiently examining document images.