The present invention relates to a document management system, and more particularly to a method and apparatus for assisting a user with the tasks of querying and retrieving documents.
With advances in electronic media, documents are becoming widely available in electronic form. Some documents are available electronically by virtue of their creation using software applications. Other electronic documents are available via electronic mails, the Internet, and various other electronic media. Yet others become available in electronic form by virtue of being scanned-in, copied, or faxed.
Today""s computing systems are becoming economical tools for organizing and manipulating these electronic documents. With the rapid development of storage system technology, the cost of storing an image of a page of a document on digital media has greatly decreased, perhaps becoming more economical than the cost of printing and storing the image on a sheet of paper. Digital document storage also provides additional advantages, such as facilitating later electronic searches and retrieval and making possible the automatic filing of documents.
For an efficient and useful digital storage system, the user must be able to query for, and retrieve documents quickly and efficiently. In fact, the utility of many storage systems often depends on the effectiveness of the query and search mechanisms. This, in turn, depends largely on the techniques used to define, describe, and catalog documents. Naturally, these tasks become more complicated as the type of documents varies and the number of documents increases.
Many conventional digital storage systems allow for text-based searching of documents though the use of keyword extraction. Although various variants of this technique exist, the user generally defines a list of keywords and the system searches for and retrieves documents containing these keywords. The search is typically performed over whole documents, without distinguishing between sections of documents. Different weighting functions are used to improve the likelihood of success in retrieval of the desired documents.
Most conventional digital storage systems, including those that purely use keyword extraction, do not provide mechanisms to define and catalog documents using images (or pictures) contained in the documents. The images can include anything that is not recognized as text, such as graphics, applications, executable code, sounds, movies, and so forth. Many conventional systems process the text in the documents and ignore the picture information. However, many documents contain both text and images, and it is beneficial to make use of the image information for improved query and search performance. The benefits become greater as the use of images becomes more prevalent and the number of documents having images increases.
As can be seen, a document management system that utilizes images in documents to enhance the effectiveness of the query and retrieval process is highly desirable.
The invention provides powerful document query and search techniques. The documents to be searched are xe2x80x9cdecomposedxe2x80x9d into xe2x80x9czones,xe2x80x9d with each zone representing a grouping of text or graphical image (also referred to herein as a xe2x80x9cpicturexe2x80x9d) or a combination thereof. The zones are generally defined within, and associated with a particular document page. One or more of the zones in the documents are selected for annotation with text (e.g., keywords), image features, or a combination of both. Document query and search are based on a combination of text annotations and image features. The invention can be used to search for text and images. As a simple example, the user can enter a text query, such as xe2x80x9csunsetxe2x80x9d, and the system can return images of sunsets because they occur in documents (in the database) that contain the word xe2x80x9csunsetxe2x80x9d in close physical proximity to the image.
A specific embodiment of the invention provides a method for operating a document retrieval system. In this method, an unindexed (also referred to as a xe2x80x9cqueryxe2x80x9d or xe2x80x9csearch keyxe2x80x9d) document is captured into electronic form. The unindexed document is then decomposed into a number of zones, with each zone including text or image or a combination thereof. The zones can be segmented into text zones and image zones. Descriptors are formed for at least one of the zones. The descriptors can include text annotations for text zones, and text annotations and image features for image zones. Documents in a document database are searched, based on the formed descriptors for the unindexed document and the descriptors for the documents in the database. At least one document in the database is identified as matching the unindexed document and reported as such.
Another specific embodiment of the invention provides a method for generating search keys for querying a document database. In this method, a query (or search key) document is formed, and a number of zones is defined for that document. Each zone is associated with text or image or a combination thereof. Descriptors for at least one of the zones are formed. Each descriptor is associated with a particular zone and includes search key information. The descriptors are used as search keys for querying the document database.
Yet another specific embodiment of the invention provides a document management system that includes an electronic storage system and a control system. The electronic storage system is configured to store a database of documents and descriptors for documents in the database. The control system couples to the electronic storage system. The control system is configured to: (1) generate descriptors for at least one zone of an unindexed document, (2) search documents in the database using the generated descriptors for the unindexed document image and the descriptors for documents in the database, (3) identify at least one document as matching the unindexed document, and (4) display the identified document.
The invention also provides software products that implement the methods described herein.
The foregoing, together with other aspects of this invention, will become more apparent when referring to the following specification, claims, and accompanying drawings.