Fast and precise text search engines are widely used in web and desktop applications. Emerging hand-held devices such as the Palm Pilot™ possess enough storage capacity to allow entire moderate-size document collections to be stored on the device for quick reference and browsing purposes. Equipping these devices with advanced index-based search capabilities is desired, but storage on hand-held devices is still rather limited.
Most advanced information retrieval (IR) applications create an inverted index in order to support high quality search services over a given collection of documents. An example of such a system is the Guru search engine, which is described by Maarek and Smadja in “Full text indexing based on lexical relations, an application: Software libraries”, Proceedings of the Twelfth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 198-206 (1989), which is incorporated herein by reference. Each document in the document collection is analyzed and represented by a vector profile of indexing units, or terms, based on the content of the document. A term might be a word, a pair of closely related words (lexical affinities), or a phrase. Each term in a document is stored in an index with an associated posting list.
A posting list comprises postings, wherein each posting comprises an identifier of a document containing the term, a score for the term in that document, and possibly some additional information about the occurrences of the term in the document, such as the number of occurrences and offsets of the occurrences. A typical scoring model used in many information retrieval systems is the tf-idf formula, described by Salton and McGill in An Introduction to Modern Information Retrieval, (McGraw-Hill, 1983), which is incorporated herein by reference. The score of term t for document d depends on the term frequency of t in d (tf), the length of document d, and the inverse of the number of documents containing t in the collection (idf).
An exemplary tf-idf formula is described by Chris Buckley, et al., in “New retrieval approaches using SMART: TREC 4,” Proceedings of the Fourth Text Retrieval Conference (TREC-4), pp. 25-48 (Gaithersberg, Md., November 1995), which is incorporated herein by reference. This formula provides that the score A(t, d) of document d for term t is
      A    ⁡          (              t        ,        d            )        =                    log        ⁡                  (                      1            +            tf                    )                            log        ⁡                  (                      1            +                          avg              tf                                )                      *                  log        ⁡                  (                      N            /                          N              t                                )                    /                      d                    Here avgtf is the average term frequency in document d, N is the number of documents in the collection, Nt is the number of documents containing term t, and |d| is the length of document d. Usually, |d| is approximated by the square root of the number of (unique) terms in d.
At search time, terms are extracted from a user's query, and their respective posting lists are retrieved from the inverted index. The document posting scores are accumulated to form document scores by summing the scores of postings pertaining to the same document. At the end of this process, the documents are sorted by their scores, and the documents with the top scores are returned.
Indexing a large document collection results in huge index files that are hard to maintain. There has been a large amount work in the field of index compression, resulting in smaller index files. Two complementary approaches exist in the art. One approach is compression at the data structure level, that is, retaining all index data while trying to obtain a more compact representation of posting lists. The other approach is pruning the index by deleting or combining terms, such as stop-word omission, and Latent Semantic Indexing (LSI). The primary goal of this sort of index pruning is to reduce “noise” in the indexing system by removing from the index terms that tend to reduce search precision, but its practical effect of reducing index size is very relevant to the subject of index compression.
In stop-word omission, language statistics are used to find words that occur so frequently in the language that they will inevitably occur in most documents. Words that are very frequent in the language (stop-words) are ignored when building an inverted index. Words such as “the” and “is” do not contribute to the retrieval task. The TREC collection, as presented in “Overview of the Seventh Text Retrieval Conference (TREC-7),” Proceedings of the Seventh Text Retrieval Conference (TREC-7) (National Institute of Standards and Technology, 1999), which is incorporated herein by reference, enumerates the frequency of words in common text documents. Ignoring the set of the 135 most frequent words in the TREC collection was found to remove about 25% of the postings (Witten, et al., Managing Gigabytes, Morgan Kaufinan Publishers, San Francisco, Calif., 1999, which is incorporated herein by reference).
Latent Semantic Indexing (LSI) is described, for example, by Deerweester, et al., in “Indexing by Latent Semantic Analysis”, Journal of the American Society for Information Science, Vol. 41, No. 1, (1990), pp. 391-407, which is incorporated herein by reference. LSI represents the inverted index as a product of three matrices, using a statistical technique called “singular-value decomposition” (SVD). This representation enables the number of terms in the index to be reduced by keeping the most significant terms while removing all others. Both LSI and stop-word omission operate at the granularity of terms. In other words, they only enable pruning entire terms from the index, so that once pruned, the term no longer appears in the index at all. When a term is pruned, its entire posting list is removed from the index.
Dynamic pruning techniques decide during the document ranking process, after the index has already been created, whether certain terms or document postings are worth adding to the accumulated document scores, and whether the ranking process should continue or stop. An exemplary technique of this sort is described by Persin, “Document Filtering for Fast Ranking”, Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland, July 1994, Special Issue of the SIGIR Forum), pages 339-348, which is incorporated herein by reference. The dynamic techniques are applied for a given query, thus reducing query time. Dynamic techniques have no effect on the index size, since they are applied to an already-stored index.