This invention relates to document management systems and more particularly to providing a method of navigating through a database of document images.
The proliferation of low-cost, high-capacity electronic storage of document images has enabled users to keep ever increasing amounts and varieties of documents, previously stored in hard copy format, as electronic information online. While this revolution in storage technology has reduced the cost of document storage, it brings with it the need for more efficient methods of searching through a myriad of online documents to find a particular document or set of documents of interest to the user.
Methods for locating a document of interest have been rudimentary at best. Typically, in these methods, documents are scanned into the computer and an Optical Character Recognition ("OCR") program converts the image into a textual file. Next, a form of keyword matching search is performed, with the system either scanning the entire text of all documents, or a set of carefully chosen keywords thought to be representative of the document by a person who initially classified the document. The problem with the first approach is the high search cost involved with traversing a large number of documents in their entirety. The difficulty with the second approach is that different persons will employ different strategies to filing and retrieval. As the heterogeneity of documents contained in databases increases, the reliability of these traditional search methods diminishes.
Recognizing the opportunity to exploit the information content of the image portion of documents, several attempts have been made to search for documents based upon matching of small images contained in the documents. For example, M. Y. Jaisimha, A. Bruce and T. Nguyen in their work, "DocBrowse: A system for Textual and Graphical Querying on Degraded Document Image Data" describe a system which searches for documents based upon company logos in letterheads. D. Doermann, et. al. in "Development of a General Framework for Intelligent Document Retrieval," outline a system for matching documents based upon generation and matching of an image descriptor which describes low-level features and high-level structure of a document. Unfortunately, this method requires intensive processing of the image information, which greatly curtails its use in most commercial applications.
While such methods provide document search capability via elemental matching of image characterization vectors, they do not provide a useful method to organize a large database of document images. These and other shortcomings indicate that what is needed is a method and system for efficiently searching a database of document images. This method would expedite search by organizing the database according to the textual as well as the visual characteristics of document images.