1. Field of the Invention
The present invention relates to a method and apparatus for creating a collection of indexed document images whereby the document images may be retrieved through the index, and to a method and apparatus for rapidly browsing through document images by viewing abstract structural views of the document images rather than the document images themselves.
2. Description of the Related Art
Recently, as increasingly larger storage devices have become available, and it has become possible to store documents not simply as ASCII text but also as a full facsimile image of the document. More specifically, it is now commonplace to convert a document into a computer-readable bit map image of the document and to store the bit map image of the document. Accordingly, whereas ASCII text storage permitted storage and display of only text portions of documents, it is now possible to store a document in computer readable form and to display not only the text but also pictures, line art, graphs, tables and other non-text objects in the document. Likewise, it is also possible to store and display documents such that text attributes, such as size, font, position, etc., are preserved.
Despite these advances, however, it is still difficult to retrieve the document images into computer memory quickly, and then to browse quickly through computer-displayed document images, for example, in a situation where a computer operator retrieves many document images and searches through those document images to find a particular document. These difficulties can be attributed to at least two limitations. First, current limitations on bandwidth of the computer systems limit of the speed at which documents may be retrieved from storage and displayed. For example, at 300 dots-per-inch resolution, an ordinary 81/2 by 11 inch black and white document requires approximately 8.4 million bits to store a full document image. Adding halftone (grey levels) or color to the image, or increasing the resolution at which the image is stored, can easily increase storage requirements to many tens of millions of bits. The time required to retrieve those bits from storage and to create and display the resulting image is significant, even with current high speed computing equipment. The time is lengthened even further in situations where the document image is retrieved from storage in a first computer and electronically transmitted, for example, by modem, to a second computer for display on the second computer.
Second, even when a full document image is displayed to an operator, there is ordinarily too much information for an average operator to comprehend quickly. Much of the information displayed to the operator is not relevant to the operator's query and much time is therefore wasted in displaying the non-relevant information. And the presence of such non-relevant information can slow the operator in his attempt to locate and understand document information that is relevant to the query.
Moreover, simply retrieving appropriate documents for presentation to an operator from a large collection of documents can prove difficult because of the large amount of information that must be searched. Conventional document retrieval systems ordinarily rely on the creation of a text index by which text documents may be retrieved. With document images (as opposed to text documents), it has been proposed to subject the document images to optical character recognition processing ("OCR processing") and to index the resulting text. Systems such as those proposed in U.S. Pat. No. 5,109,439 to Froessl suggest that is only necessary to OCR-process specific areas of the document to simplify the indexing process, but it has nevertheless heretofore proved difficult to create an adequate index for retrieval of document images.