The present invention relates to the field of document storage and retrieval, in particular, the retrieval of a document from a document database using content from an example page taken from the document.
A general approach to the problem of retrieving a target document from document database is to store a set of key words with each document either physically with the document or, more probably, in a lookup table in which the keys are indexed and table entries point to documents in the database. Keys can be easily generated from documents if electronic versions of documents are available. If only paper versions of the documents are available, they can be scanned to form digital images of the pages of the documents and the digital images can be processed by a character recognizer to extract the text of the document and thus the keys. In a more labor-intensive system, the keys can be manually entered.
To retrieve a document, the keys are supplied to a search engine. Where a user is not likely to remember the keys for every document stored in the database, the user can retain an example page from each document as it is stored and supply that example page to a page analyzer for key extraction.
The disadvantage of this general approach is that the documents in the document database and the example pages either need to originate and remain in electronic form, or character recognition would need to be done on example pages to determine the keys. Thus, either the example page needs to be electronic or has to be of sufficient quality that errors do not occur in the scanning and character recognition process.
One example of a prior art system for document presentation is the RightPages document presentation system described in G. Story, "The Right Pages Image-Based Electronic Library for Alerting and Browsing", COMPUTER, Sept. 1992. In that system, a user is presented with a series of journal covers and the user browses the journal covers to find a desired journal, then browses its table of contents and then selects an article from the journal. Once an example page of a journal article is selected, the system retrieves the target article from a document database. The disadvantage to the RightPages system is that the icons are presented on a computer monitor and therefore are lower resolution than print, and the links between the journal covers and the pages must already exist. Thus, the user must be at the computer monitor to browse example pages.
The document storage and retrieval system taught by U.S. Pat. No. 5,465,353 to Hull, et al., entitled "IMAGE MATCHING AND RETRIEVAL BY MULTI-ACCESS REDUNDANT HASHING" (commonly owned by the assignees of the present application, incorporated by reference herein, and hereinafter "Hull") is a system for retrieving a target document from a document database by submitting a paper example page retained from the target document to a search engine. The search engine analyzes the example page and determines likely matches among the documents in the database. Where many, documents are to be stored however, storage and organization of the example pages raises some of the same problems that document database storage tries to alleviate, such as having to allocate storage space for paper pages and keeping them organized.
Thus, what is needed is a system for efficiently storing example pages for use in document retrieval and document management.