1. Field
The described embodiments generally relate to techniques for compressing data.
2. Related Art
The relentless growth of the Internet is making it increasingly hard for search engines to comb through the billions of web pages that are presently accessible through the Internet. Search engines typically operate by identifying web pages containing occurrences of specific terms (i.e., words) within these documents. For example, a search engine might search for all web pages containing the terms “military” and “industrial.” A search engine can also search for web pages containing a specific phrase, such as “flash in the pan.”
Search engines generally use an “inverted index” to facilitate searching for occurrences of terms. An inverted index is a lookup structure that specifies where a given term occurs in the set of documents. For example, an entry for a given term in the inverted index may contain identifiers for documents in which the term occurs, as well as offsets of the occurrences within the documents. This allows documents containing the given term to be rapidly identified.
For example, referring to FIG. 1, an exemplary search engine 112 operates by receiving a query 113 from a user 111 through a web browser 114. This query 113 specifies one or more terms to be searched for in the set of documents. In response to query 113, search engine 112 uses inverted index 110 to identify documents that satisfy the query. Search engine 112 then returns a response 115 through web browser 114, wherein the response 115 contains references to the identified documents.
Documents can also be stored in compressed form in a separate compressed repository 106. This allows documents or portions of documents (snippets) to be easily retrieved by search engine 112 and to be displayed to user 111 through web browser 114.
As is illustrated in FIG. 1, web crawler 104 continually retrieves new documents from web 102. These new documents feed through a compressor 105, which compresses the new documents before they are stored in compressed repository 106. The new documents also feed through indexer 108, which adds terms from the new documents into inverted index 110. The inverted index 110 illustrated in FIG. 1 can be used to efficiently identify specific terms in documents.
Note that compressed repository 106 and inverted index 110 are comprised of sequences of integers. Moreover, it is desirable to store these integers in compressed form because the document corpus can potentially contain billions of web pages.
Various techniques can be used to compress these integers, such as compression techniques which use an Elias gamma code (referred to as a “gamma code”). A gamma code is a universal code which can be used to encode positive integers. To encode an integer using a gamma code, the integer is separated into the highest power of two it contains 2N and the remaining N binary digits of the integer. The number N is encoded in unary (for example using a string of N zeros followed by a one) and this unary string is prepended to the remaining N binary digits. Hence, the number of zeros at the beginning of the encoded number indicates the number of bits which follow the one in the remaining binary number. (For example, see http://en.wikipedia.org/wiki/Elias_gamma_coding.)
Although a gamma code can be efficient in some applications, the gamma code effectively uses twice the number of bits as is necessary to represent a number. Hence, the gamma code generally does not compress a sequence of integers as efficiently as possible.