The present invention relates generally to information retrieval or filtering systems and more particularly to methods for dynamically indexing words contained in a set of documents in information retrieval or filtering system.
Information retrieval or filtering systems generally employ an index file that indexes information stored in a database. The index file is used to locate information in the database. The index file contains reference information for respective words, where for each word, the reference information points to occurrences of the word in documents stored in the database. The reference information for a word is also referred as xe2x80x9cpostingsxe2x80x9d of the word.
Most indexing techniques are xe2x80x9cstaticxe2x80x9d because the indexing employed in such techniques is performed in two phases. In the first phase of the indexing, input files are usually read to build some temporary internal files. In the second phase of the indexing, the temporary internal files are optimized to prepare for retrieval. Hence, the indices are static once the optimization is complete. That is, it is impossible to add new documents without rebuilding the whole index. Queries for the retrieval of documents cannot be completed until the second phase of the indexing is performed.
Dynamic indexing techniques have been introduced to overcome the limitations of static indexing techniques. Indexes are accumulated in an index file that is checked without the optimization of an internal file at each time for retrieval queries. In the conventional dynamic indexing technique, the index file is organized into a set of fixed length of blocks where postings for words are stored. The blocks pack postings for several words together with more or less free space. An address record table is kept to store the block number for each posting, and a free block list is kept to store information about blocks containing a sufficient amount of free space (see, xe2x80x9cManaging Gigabytes, Compressing and Indexing Documents and Images,xe2x80x9d by I. Witten, A. Moffat and T. Bell).
In such conventional indexing techniques, it takes a long time to update the index file when new documents are added to the database and the collection of indexes grows larger. In addition, the amount of free space needed at each time of updating the postings for words is unpredictable.
The present invention provides information retrieval system or filtering systems for dynamically indexing a set of documents in a database. In particular, the present invention provides methods for indexing a set of documents in a single phase. Information retrieval or filtering systems of the present invention are able to respond to retrieval queries without generating and optimizing internal files.
A single phase indexing technique of the present invention enables the database to be queried at all times. The present invention provides information retrieval or filtering systems that respond to retrieval queries while in the process of indexing. In order to allow retrieval at all times, the present invention stores postings for a word sequentially in memory so that the postings can be retrieved from the memory with a minimum number of input/output (I/O) operations.
The present invention allows information retrieval or filtering systems to incrementally store and update postings for a word while keeping each postings for a word sequentially on memory space. When a new document is inserted in a database which contains many words, the present invention provides information retrieval or filtering systems where the postings for all these words are expanded in a manner of xe2x80x9cmultipoint insertionxe2x80x9d rather than a simple append operation.
In accordance with one aspect of the present invention, a method for allocating the blocks of index file to the postings for words found in documents of a database is provided. The index file is provided with a predetermined size of initial block and the block is partitioned into successively decreasing sized blocks. A block is divided into n blocks of a successive level. The blocks in each successive level have same size. The sum of sizes of blocks in each successive level equals the size of initial block. An information retrieval interface allocates to the postings for a first word a free block in the closest matching level to the size of postings for the first word in the index file. The size of the free block is able to accommodate holding the postings for the first word in the index file.
In accordance with another aspect of the present invention, a method for updating postings for words in an index file is provided. An information retrieval interface allocates blocks of the index file to the postings for words contained in the index file. The blocks are partitioned into successively decreasing levels of blocks in size. The information retrieval interface updates postings for a word in a first block of the index file. The updated postings contain additional postings for the word in added documents of the database. The information retrieval interface searches from a free block list a second block that is free to accommodate the updated postings for the word. The free block list contains information about whether or not a block is free. The information retrieval interface moves the postings for the word from the first block to the second block.
In accordance with a further aspect of the present invention, method for allocating an index file containing postings for words found in documents of a database is provided. The index file is provided with blocks that are partitioned into successively decreasing levels of blocks in size. The blocks in each successive level have same size. The size of postings for a word in the index file is calculated to determine a level that is closest to the postings for the word in the block structure. A free block is searched within the level from a free block list containing information about free blocks of the level to accommodate holding the postings for the word. The free block in the level is allocated to the postings for the word.
A single phase indexing technique of the present invention makes it possible to construct static databases in multiple batches and develop dynamic systems such as information filtering systems. A single phase indexing technique of the present invention enables information retrieval or filtering systems to respond to retrieval queries while in the process of indexing without reorganizing internal files. The present invention supports a collection of dynamically changing variable-length postings for words.