1. Technical Field
The present invention relates generally to information retrieval and, in particular, to compressing index files in information retrieval.
2. Description of Related Art
The purpose of an information retrieval (IR) system is to search a database of documents to find the documents that satisfy a user's information need, expressed as a query.
Most of the current IR systems convert the original text documents into index files, which are used in the actual search. The index file contains information about terms (e.g., words and phrases) found in the individual documents. In particular, a data structure known as an “inverted index” or an “inverted file” stores for each term a list of documents containing the term, together with the number of occurrences (also interchangeably referred to herein as “counts” and “frequencies”) of the term in each of the documents. Similarly, a direct (or “word-based”) index contains for each document a list of terms with their frequencies in the document. An inverted index is shown in Table 1 and a direct index is shown in Table 2.
TABLE 1term_1: doc_1, count_1, dcc_2, count_2, . . .term_2: doc_1, count_1, doc_2, count_2, . . .
TABLE 2doc_1: term_1, count_1, term_2, count_2, . . .doc_2: term_1, count_1, term_2, count_2, . . .
FIG. 1 is a flow diagram illustrating a method for generating an index file for information retrieval, according to the prior art. A text file (e.g., one or more documents) is identified (step 110). Next, terms that occur in the text file as well as counts of those terms are ascertained from the text file (step 120). An index file is created that specifies the terms and counts (step 130). Such an approach requires an extensive amount of media space to store the index file. For example, storing the actual number of occurrences of a term in a document for most applications requires eight or sixteen bits of storage per term and document, allowing for the storage of term frequencies up to 256 or 65536, respectively.
With respect to relevance scoring in information retrieval, most current information retrieval systems estimate the relevance of a document with respect to a query based on the terms co-occurring in the document and the query. Each such term contributes to the total relevance score by a quantity that depends on the following: (1) the frequencies of the term in the query and the document; and (2) the weight assigned to the term based on the frequency of the term in the corpus, e.g., the word “the” occurs in numerous documents and thus its weight is set lower than the weight of the word “computer”.
The Okapi formula is an example of such a relevance scoring technique. The Okapi formula is described by: Robertson et al., in “Okapi at TREC-3”, Proceedings of the Third Text REtrieval Conference (TREC-3), NIST Special Publication 500-226, ed. by D. K. Harman, pp. 109–26, 1995. According to the Okapi formula, terms in the intersection of the query and document contribute to a relevance score as follows:
                    s        =                  tf          *          qtf          *                      idf            /                          [                              c1                +                                  c2                  *                                      (                                          dl                      /                      avdl                                        )                                                  +                tf                            ]                                                          (        1        )            where tf and qtf are the document and query frequencies for a given term, dl is the document length, avdl is the average length of the documents in the collection, c1 and c2 are constants (e.g. c1=0.5, c2=1.5) and idf is the inverse document frequency, computed as:
                    idf        =                  log          ⁡                      [                                          (                                  N                  -                  n                  +                  0.5                                )                            /                              (                                  n                  +                  0.5                                )                                      ]                                              (        2        )            where N is the total number of documents in the collection and n is the number of documents containing the given term.
FIG. 2 is a flow diagram illustrating a method for relevance scoring in information retrieval, according to the prior art.
A query having one or more terms is received (step 210). Counts for terms occurring in the query that also occur in one or more documents (i.e., that co-occur in the query and one or more documents) are respectively ascertained for the one or more documents (step 220). Relevance scores for the one or more documents are accumulated based on the counts (step 230). The relevance scores are then sorted (step 240). A list specifying the highest scoring documents is then output (step 250).
Conventional methods of compressing index files focus on reducing the space required to store the document or word identifier components of the index file. Such conventional methods are summarized by: Witten et al., “Managing Gigabytes: Compressing and Indexing Documents and Images”, Van Nostrand Reinhold, ISBN: 0442018630, pp. 82–95, January 1994; and Baeza-Yates et al., “Modern Information Retrieval”, ACM Press, ISBN: 0-201-39829-X, pp. 173–89, May 1999. However, as noted above, the storage of term frequencies nonetheless requires a significant amount of memory that increases the overhead of an IR system.
Accordingly, it would be desirable and highly advantageous to have a method for compressing index files in information retrieval that further reduces the media space required for such storage in comparison to prior art approaches for accomplishing the same. For example, such a method should obviate the need to store the actual term frequencies in the index file.