1. Field of the Invention
The present invention relates to a new inverted index storage structure for dynamic document databases that can maintain document identifier (docID) order in the posting list upon frequent insertion, deletion, and modification of documents and that enables fast search for the posting of a specific document from a long posting list.
The invented storage structure uses large objects and subindexes. A large object stores a database object whose size exceeds one disk page and dynamically adjusts the storage space upon insertion and deletion. A large object typically employs a tree structure for dynamic allocation of space. We store each posting list in a large object and create a subindex on each posting list. A large object manages variable storage space of its posting list upon document addition and deletion. A subindex having docID as the key is used for docID-order preservation of a posting list upon document addition and for removal of a posting from a posting list upon document deletion.
2. Description of the Prior Art
A document can have several attributes such as the document number, modification date, authors, abstract, and body. Some of these attributes have a structured data type such as integer and character string, while others have an unstructured data type such as text, image, audio, and video. Document retrieval based on the attributes of the text data type is traditionally well performed by an information retrieval system, while document retrieval based on the attributes of structured data types by a database management system. Thus, for document retrieval based on the attributes of both text and structured data types, tight coupling of the database management system and information retrieval is necessary. In addition, the database management system can provide consistent management of documents and their indexes upon document addition, deletion, and modification.
Inverted index is an index used for information retrieval, which retrieves documents containing keywords given by a user. Inverted index keeps a list of postings for each keyword extracted from documents. A posting of a keyword consists of a document identifier (docID) and a list of locations in the document where the keyword occurs[Frakes, W. B. and Baeza-Yates, R., Information Retrievalxe2x80x94Data Structures and Algorithms, Prentice Hall, Englewood Cliffs, N.J., 1992.; Salton G., Automatic Text Processingxe2x80x94The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley Press, 1988.]. The number of postings of a keyword, i.e., the number of documents containing the keyword, is called the document frequency, and the number of occurrences of a keyword in a document is called the term frequency. The document frequency indicates the length of the posting list of a keyword, and the term frequency the length of a posting. Information retrieval using inverted index is done by retrieving the posting list of the keyword given by a user. Thus, inverted index needs a storage structure to find the posting list of a keyword fast.
The storage space for the posting list increases when new postings of an added document are inserted into the posting list, while it decreases when old postings of a deleted document are removed from the posting list. Document modification also affects the length of the posting list because the old document is deleted and the modified document is added. Thus, inverted index needs a storage structure for managing variable storage space of the posting list efficiently.
If posting lists are kept in the order of docIDs, the boolean expression containing more than one keyword can be efficiently processed by sequentially scanning and merging the posting lists of the keywords only once. The query xe2x80x9cfind the documents about multimedia and database,xe2x80x9d for example, contains two keywords of xe2x80x9cmultimediaxe2x80x9d and xe2x80x9cdatabase,xe2x80x9d and is processed by sequentially scanning and merging the posting lists of xe2x80x9cmultimediaxe2x80x9d and xe2x80x9cdatabase.xe2x80x9d
In document databases where document addition or modification is frequent, inverted index is dynamically updated on adding or modifying documents. On updating the inverted index, new postings of the added document should be inserted into the posting list in the order of docIDs. Thus, inverted index needs a storage structure to insert new postings efficiently into the posting list in the order of docIDs.
Document deletion is frequent because modifying a document requires deleting the old document before adding the modified document. If a document is deleted from a document database, all the postings of the deleted document should also be removed from the inverted index. Thus, the inverted index needs the following two storage structures: one for finding all the posting lists where the postings of the deleted document are included; the other for searching the posting list to find the posting of the deleted document. Because many keywords having long posting lists are contained in most documents, searching a long posting list or a specific posting is an important part of the performance in document deletion.
The simplest method for storing an inverted index is to store a table of records consisting of a keyword and a posting in a database. This method, however, is known to have low query performance and to require excessive storage space due to redundancy of keywords [Stonebraker, M., xe2x80x9cDocument Processing in a Relational Database System,xe2x80x9d ACM Trans. on Office Information Systems, Vol. 1, No. 1, pp. 143-158, 1983; DeFazio, S., Daoud, A., Smith, L. A., and Srinivasan, J., xe2x80x9cIntegrating IR and RDBMS Using Cooperative Indexing,xe2x80x9d In Proc. Int""l Conf. on Information Retrieval, ACM SIGIR, pp. 84-92, Seattle, 1995.].
Therefore, a lot of studies have been done on the method using tree structures instead of database tables for storing inverted indexes. FIG. 1 shows such a conventional inverted index storage structure. The reference numeral 10 shows a B+-tree having the keywords as the key. A pointer to a posting list is stored at the pointer field of the index entry in the leaf node of the tree. The reference numeral 11 shows the storage space for the posting list. Samuel DeFazio et al. [DeFazio, S., Daoud, A., Smith, L. A., and Srinivasan, J., xe2x80x9cIntegrating IR and RDBMS Using Cooperative Indexing,xe2x80x9d In Proc. Int""l Conf. on Information Retrieval, ACM SIGIR, pp. 84-92, Seattle, 1995.] argued through experiments that this storage structure out performs the database table method in query performance and storage space. Doug Cutting [Cutting, D. and Pedersen, J., xe2x80x9cOptimizations for Dynamic Inverted Index Maintenance,xe2x80x9d In Proc. Int""l Conf. on Information Retrieval, ACM SIGIR, pp. 405-411, Brussels, 1990.], Christos Faloutsos [Faloutsos, C. and Jagadish, H. V., xe2x80x9cOn B-tree Indices for Skewed Distribution,xe2x80x9d In Proc. Int""l Conf. on Very Large Data Bases, pp. 363-374, Vancouver, 1992.], and Anthony Tomasic [Tomasic, A., Garcia-Molina, H., and Shoens, K., xe2x80x9cIncremental Updates of Inverted Lists for Text Document Retrieval,xe2x80x9d In Proc. Int""l Conf. on Management of Data, ACM SIGMOD, pp. 289-300, Minneapolis, 1994.] studied the problem of storage space allocation for the reference numeral 11 when the posting list continuously increases due to document addition.
The conventional storage structures for the inverted index, however, have the following drawbacks: 1) They do not consider maintaining the docID-order in the posting list when inserting a new posting into the posting list upon document addition. Thus, boolean expressions involving more than one keyword are not evaluated efficiently. 2) They consider only adding but not removing the postings from the posting lists. Thus, document deletion is not properly handled. 3) They do not provide fast access to the posting for a specific document in a posting list. They simply assume sequential search, which requires reading half of the posting list on the average.
The inverted index is an index used for information retrieval (IR), i.e., for retrieving documents containing the keywords given by the user. The inverted index maintains a list of postings for each keyword extracted from the documents. Insertion, deletion, and modification of documents entail dynamic update of the inverted index. Thus, the inverted index needs a storage structure for effectively managing the posting lists, which is of variable size and is potentially long. We also need an inverted index structure to support integrated database and information retrieval queries by processing the IR part of the query using the result of the database part of the query and vice versa in a database management system (DBMS) tightly coupled with information retrieval. In this patent, we solve these problems by inventing an inverted index structure using the subindex and the large object management scheme.
First, we store a long posting list in a large object and manage storage space using the large object management scheme. Thus, when a new posting is inserted in a long posting list or when an existing posting is deleted, there is no physical movement of other postings, which results in effective extension and shrinkage of the long posting list and thus in enhanced update performance.
Second, we construct the subindex on document identifiers (docID) for long posting lists. The subindex is used for finding fast the locations of postings that have a given docID in a long posting list. Thus, we can enhance the performance when storing new postings into the posting list in the order of docID""s, or when deleting postings of the deleted documents from the posting list. The performance of inserting and deleting postings in/from a long posting list is of importance since most documents contain keywords having long posting lists. Moreover, if postings are stored in a posting list in the order of docID""s, Boolean queries consisting of more than one keyword can be efficiently processed. Last, we store the database object ID together with docID in each posting. An object ID points to the database object that has attribute data of the document. Then, integrated processing is possible of the IR part of the query and of the database part of the query using the docId and the object ID. That is, when we process the IR part of the query first, we can process the database part of the query by accessing the desired database object through the object ID stored in the posting. Likewise, when we process the database part of the query first, we can process the IR part of the query by looking for the desired posting through the subindexxe2x80x94using the docID stored in the database objectxe2x80x94in the posting list of the given keyword. The IR part of the query in the latter checks whether a designated document (with docID) has the given keyword.