The invention relates to a method and apparatus for indexing the content of displayable digital documents and to a method and apparatus for querying and retrieving a portion of a displayable digital document using such an index.
As increasingly large storage devices have become available it is now common practice to store documents in digitized form. For example a hardcopy document containing text and graphics may be digitized using a scanner into a bit map image and stored as a computer readable bit map file. Many other types of digitized formats are used including .PNG (Portable Network Graphics), .JPEG (Joint Photographic Experts Group), .GIF (Graphics Interchange Format) and .TIFF (Tag Image File Format). Other types of file formats capable of handling images and text such as .HTML (Hypertext Markup Language), and .PDF (Portable Document Format). Also commonly used and stored. Each of these formats can typically be displayed using a particular displayable digital document viewer software tool. Some tools are able to handle various formats and have the ability to convert from one to another.
Applications which store displayable digital documents have an advantage over older systems which stored documents simply as ASCII (American Standard Code for Information Interchange) text in that pictures, line art, images, graphs, tables, and other parts of the document are also stored and displayed. The term ASCII text as used herein shall include other text codes such as EBCDIC (Extended Binary Coded Decimal Interchange Code) text, BCD (Binary Coded Decimal) text, and equivalents including special codes for foreign language diacritical marks or different alphabets such as Cyrillic, Greek, Arabic, Armenian or Sanscrit. Indexing of such documents, in order to permit searching, browsing, and easy retrieval however, is a difficulty because index methods applied in the past to ASCII text documents does not work with these new formats. Various approaches have been tried to overcome this problem.
King et al. in U.S. Pat. No. 5,600,775 describe an indexing scheme to allow multimedia developers to change data in a vast file such as a full motion video. Individual frames of video are annotated with text, graphics, hand drawn images, and digital audio without modification to the original video information. The video data and annotations are stored separately. The annotations are related to a particular video frame by an index such as a frame video timing parameter.
Sotomayor in U.S. Pat. No. 5,708,825 describes an indexing method for text data. The method uses weighting rules to determine from the textual data what are the most significant phrases. Various types of summary pages are generated including key-topic index entries and hyperlinks to pages where the key-topics appear.
Yokoyama et al. in U.S. Pat. No. 5,983,171 describe a method of automatically compiling an index of a text document. Words and phrases are extracted using a word or phrase analysis program. The respective locations of the words or phrases in the document are also extracted at the same time. A user inputs an indexing object extraction condition. Words and phrases previously extracted are registered into an index candidate dictionary based to relevance to the indexing object extraction condition. Finally, an index is compiled using the index candidate dictionary.
Palmer et al. in U.S. Pat. No. 6,002,798 describe a method for creating an index for storage and retrieval of document images. A document image is obtained by scanning an original document. The structure of the document is determined by conventional block selection techniques which utilize a rule-based knowledge system for identifying specific areas in a document and for determining the content of the image within those areas so that the document image is decomposed into a general set of objects. One block selection technique is described in U.S. Pat. No. 5,680,479 by Wang et al. U.S. Pat. No. 6,002,798 filed Jan. 19, 1993 by Palmer et al. and U.S. Pat. No. 5,680,479 filed Apr. 24, 1992 by Wang et al. are hereby incorporated by reference in their entirety. The structure is stored along with the document. A retrieval index may be created by using the block selection techniques to identify areas of first type e.g. title areas. The areas are converted to text by optical character recognition (OCR) techniques. The converted text is then indexed to form the retrieval index which is stored together with the document image.
Downs et al. in U.S. Pat. No. 6,067,553 describe a method of re-organizing the data in a .PDF file in order to permit a user to view parts of the file before the entire file is loaded. By repeatedly accessing a recognition look-up table and dynamically updating an object definition look-up table, a graphics processor may display contents of a file as they arrive, rather than after the entire contents have been received.
Despite the foregoing developments a satisfactory method of indexing displayable digital documents in a relational database remains a problem. In accordance with the present invention, there is defined a new method and system of indexing such documents into a relational database. It is believed that such a method and system constitutes a significant advancement in the art.
It is therefore a principal object of the present invention to enhance the indexing art by providing a method of indexing a displayable digital document with enhanced capabilities.
It is another object to provide such a method having enhanced querying and retrieval capabilities.
It is a further object to provide a system with enhanced indexing capabilities.
It is yet another object to provide a computer program product capable of indexing a displayable digital document with enhanced capabilities.
These and other objects are attained in accordance with one embodiment of the invention wherein there is provided a method of indexing a displayable digital document, comprising the steps of, providing a displayable digital document, displaying the document with a displayable digital document viewer and selecting a field for indexing using a pointing device, recording offsets and a bounding rectangle of the selected field, comparing the bounding rectangle with other bounding rectangles in the displayable digital document, and recording in a relational database, a page number and offsets of the other bounding rectangles which compare.
In accordance with another embodiment of the invention there is provided a method of indexing a displayable digital document, comprising the steps of, providing a displayable digital document having one or more document fields, providing a database field in a relational database, displaying the document with a displayable digital document viewer and selecting one of the document fields for indexing corresponding to the database field, using a pointing device, recording offsets and a bounding rectangle of the selected field, comparing the bounding rectangle with other bounding rectangles in the displayable digital document, and recording in a relational database, a page number and offsets of the other bounding rectangles which compare.
In accordance with yet another embodiment of the present invention there is provided a system for indexing a displayable digital document, comprising, a displayable digital document, a displayable digital document viewer having a pointing device, the viewer adapted for selecting a field of the displayable digital document for indexing using the pointing device, means for recording offsets and a bounding rectangle of the selected field, means for comparing the bounding rectangle with other bounding rectangles in the displayable digital document, and means for recording in a relational database, a page number and offsets of the other bounding rectangles which compare.