The present invention relates generally to the field of document storage and retrieval, and in particular, the retrieval of a document from a document database using content from an example page taken from the document.
Electronic storage of documents provides a number of advantages over storage of paper documents. An entire bitmap of a page image can now be scanned and stored on magnetic disk for less than the cost of a sheet of paper. Also text and graphical editing operations, such as cut and paste, are easy to perform on electronic documents. These advantages exist whether or not information is electronically extracted via optical character recognition (OCR) or otherwise. However, extraction provides additional advantages, such as text editing and key word searching. "Extracted" is a term used to describe a document store in a form which is not merely a bit map of the document image. Word processing documents are a form of extracted documents.
However, the paper medium maintains some advantages over electronic media. Paper is portable and viewable without a reading apparatus or the need for a power source. The standard size of paper sheets allows for easy passage between a variety of containers, from envelopes to ring binders. Two characteristics of paper in particular make browsing easy, namely the high "flip speed" possible from collated sheets, and the much higher resolution available on the printed page relative to the resolution of computer monitors.
Other, less commonly considered attributes of paper include tangibility, and social rituals. Harper and Sellen, (Collaborative Tools and Practicalities of Professional Work at the International Monetary Fund, Conference Proceedings of CHI '95, Denver, p. 122-129) point out that paper can be a key part of interpersonal communication: "Paper documents can be the focus of a face-to-face meeting, can be placed on a desk in view of all parties . . . and paper documents can be ritually exchanged once an agreement as to its interpretation has been made." Wittaker and Schwarz, (Back to the Future: Pen and Paper Technology Supports Complex Group Coordination, Conference Proceedings of CHI '95, Denver, p. 495-502) describe one group's replacement of computer coordination software with paper placed on wallboards, attributing this to the size, public nature, visual and material characteristics of paper. They also suggest that the simple manual actions involved with paper manipulation, or note taking, encourage additional mental reflection on the work at hand.
Given the persistence of paper in the office environment, it is worthwhile to consider creating tools that allow electronic systems and paper documents to interact. Examples of this methodology include Protofoil.TM. (Protofoil: Storing and Finding the Information Worker's Paper Documents in an Electronic File Cabinet, Conference Proceedings of CHI '94, Boston, p. 180-185) which uses a form of electronic-paper interaction in a office filing system. In that system, users place paper cover sheets before a document in an automatic document feeder to provide some job control and document attribute information.
A general approach in an electronic document database system to the problem of retrieving a target document from the document database is to store a set of key words with each document either physically with the document or, more probably, in a lookup table in which the keys are indexed and table entries point to documents in the database. Keys can be easily generated from documents if "extracted" versions of documents are available. If only paper versions of the documents are available, they can be scanned to form digital images of the pages of the documents and the digital images can be processed using OCR to extract the text of the document and thus the keys. In a more labor-intensive system, the keys can be manually entered.
In such a system, to retrieve a document, the keys are supplied to a search engine. Where a user is not likely to remember the keys for every document stored in the database, the user can retain an example page from each document as it is stored and supply that example page to a page analyzer for key extraction.
The disadvantage of this general approach is that the documents in the document database and the example pages either need to originate and remain in the extracted form, or OCR would need to be done on example pages to determine the keys. Thus, either the example page needs to be electronic or has to be of sufficient quality that errors do not occur in the scanning process of the character recognition process required to extract the keys from a bit map.
One example of a prior art system for document presentation is the RightPages document presentation system described in G. Story, "The RightPages Image-Based Electronic Library for Alerting and Browsing", COMPUTER, September 1992. In that system, a user is presented with a series of journal covers and the user browses the journal covers to find a desired journal, then browses its table of contents and then selects an article from the journal. Once an example page of a journal article is selected, the system retrieves the target article from a document database. The disadvantage to the RightPages system is that the icons are presented on a computer monitor and therefore are lower resolution than print, and the links between the journal covers and the pages must already exist. Thus, the user must be at the computer monitor to browse example pages.
The document storage and retrieval system taught by Hull is a system for retrieving a target document from a document database by submitting a paper example page retained from the target document to a search engine. The search engine analyzes the example page and determines likely matches among the documents in the database. Where many, many documents are to be stored however, storage and organization of the example pages raises some of the same problems that document database storage tries to alleviate, such as having to allocate storage space for paper pages and keeping them organized.
Thus, what is needed is a system for efficiently storing example pages for use in document retrieval and document management.