Modern Web search engines incorporate a variety of numerical Web-page attributes in their search ranking functions in an attempt to bring order to the ever-growing Web. Given the massive repositories that Web search engine's must index, with large numbers of concurrent users issuing queries to the system, developing memory-efficient encodings for these numerical attributes, so that they can be cached in main memory, is an increasingly important challenge.
An overview of a scalable keyterm-search system helps make clear why per-document attributes, such as page popularity, are maintained in main memory. As depicted in FIG. 1, a typical Web search system utilizes an inverted text index I and a set of auxiliary ranking vectors {{right arrow over (R)}i}. For concreteness, consider a system with only one such vector, {{right arrow over (R)}p}, containing per-document popularity estimates. In FIG. 1, {{right arrow over (R)}p} is a single column (r1, r2 or r3) in index 102. The index I contains information about the occurrences of terms in documents and is used to retrieve the set of document IDs for documents satisfying some query Q. The index {{right arrow over (R)}p} is then consulted to retrieve the overall popularity score for each of these candidate documents. Using the information retrieved from I and {{right arrow over (R)}p}, a composite document score is generated for each candidate result, yielding a final ranked listing.
The inverted index/is constructed offline and provides the mapping {t→ƒdt} where ƒdt describes the occurrence of term t in document d. In the simplest case, ƒdt could be the within-document frequency of t. The number of random accesses to I needed to retrieve the necessary information for answering a query Q exactly equals the number of terms in the query, |Q|. Because queries are typically small, consisting of only a few terms, it is practical to keep the index I on-disk and perform |Q| seeks for answering each query.
The auxiliary index {right arrow over (R)}p is also constructed offline, and provides the mapping {d→rd}, where rd is the popularity of document d according to some computed notion of popularity. Note that in contrast to I, the index {right arrow over (R)}p provides per-document information. In some but not all cases, the search system accesses {right arrow over (R)}p once for each candidate document of the result set, which could potentially be very large. These random accesses would be prohibitively expensive, unless {right arrow over (R)}p can be kept entirely in main memory. Whereas the query length is the upper bound for the accesses to I, the number of candidate results retrieved from I is the upper bound for accesses to {right arrow over (R)}p. One way to reduce the number of random accesses required is to store the attribute values of in I instead; e.g., create an index I′ that provides the mapping {t→{ƒdi,rd}}. However, this requires replicating the value rd once for each distinct term that appears in rd, generally an unacceptable overhead especially if more than one numeric property is used.
Much work has been done on compressing I, although comparatively less attention has been paid to effective ways of compressing auxiliary numeric ranking vectors such as {right arrow over (R)}p. The typical keyterm search system has only one such auxiliary ranking vector {right arrow over (R)}l, the document lengths needed in computing the query-document cosine similarity. For more information on the query-document cosine similarity metric, see Witten et al., Managing Gigabytes, Morgan Kaufmann, San Francisco, 1999, which is hereby incorporated by reference in its entirety. This metric can be kept in main memory without much difficulty. However, for more comprehensive ranking schemes, such as PageRank and topic-sensitive PageRank, which require consulting a set of auxiliary ranking vectors, more consideration needs to be given to the encodings used for the attribute values. For more information on such ranking schemes see, for example, Lawrence et al., “The PageRank citation ranking: Bringing order to the web,” Stanford Digital Libraries Working Paper, 1998; Haveliwala, “Topic-sensitive PageRank,” Proceedings of the Eleventh International World Wide Web Conference, 2002; Richardson and Domingos, “The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank,” volume 14. MIT Press, Cambridge, Mass., 2002; and Jeh and Widom, “Scaling personalized web search,” Stanford University Technical Report, 2002; Brin and Page, “The Anatomy of a Large-Scale Hypertextual Search Engine,” 7th International World Wide Web Conference, Brisbane, Australia; and U.S. Pat. No. 6,285,999, each of which is hereby incorporated by reference in its entirety.
Falling main memory prices have not alleviated the need for efficient encodings. This is because increasingly affordable disk storage is leading to rapidly growing Web-crawl repositories, which in turn is leading to larger sets of documents that need to be indexed. Utilizing a rich set of per-document numeric ranking attributes for growing crawl repositories and growing numbers of users thus continues to require efficient encoding schemes.
In summary, the rapid growth of the Web has led to the development of many techniques for enhancing search rankings by using precomputed numeric document attributes such as the estimated popularity or importance of Web pages. For efficient keyterm-search query processing over large document repositories, it is important that these auxiliary attribute vectors, containing numeric per-document properties, be kept in main memory. When only a small number of attribute vectors are used by the system (e.g., a document-length vector for implementing the cosine ranking scheme), a 4-byte, single-precision floating point representation for the numeric values suffices. However, for richer search rankings, which incorporate additional numeric attributes (e.g., a set of page-importance estimates for each page), it becomes more difficult to maintain all of the auxiliary ranking vectors in main memory.
Accordingly, given the above background, effective systems and methods for compressing precomputed auxiliary ranking vectors would be highly desirable.