1. Field of the Invention
The present invention relates to techniques for storing documents in computer systems. More specifically, the present invention relates to a document compression scheme that supports both searching and efficient decompression of portions of documents.
2. Related Art
The relentless growth of the Internet is making it increasingly harder for search engines to comb through the billions of web pages that are presently accessible through the Internet. Search engines typically operate by identifying web pages containing occurrences of specific terms (i.e. words) within these documents. For example, a search engine might search for all web pages containing the terms “military” and “industrial”. A search engine can also search for web pages containing a specific phrase, such as “flash in the pan”.
Existing search engines generally use an “inverted index” to facilitate searching for occurrences of terms. An inverted index is a lookup structure that specifies where a given term occurs in the set of documents. For example, an entry for a given term in the inverted index may contain identifiers for documents in which the term occurs, as well as offsets of the occurrences within the documents. This allows documents containing the given term to be rapidly identified.
Referring to FIG. 1, a search engine 112 generally operates by receiving a query 113 from a user 111 through a web browser 114. This query 113 specifies a number of terms to be searched for in the set of documents. In response to query 113, search engine 112 uses inverted index 110 to identify documents that satisfy the query. Search engine 112 then returns a response 115 through web browser 114, wherein the response 115 contains references to the identified documents.
Documents can also be stored in compressed form in a separate compressed repository 106. This allows documents or portions of documents (snippets) to be easily retrieved by search engine 112 and to be displayed to user 111 through web browser 114.
As is illustrated in FIG. 1, web crawler 104 continually retrieves new documents from web 102. These new documents feed through a compressor 105, which compresses the new documents before they are stored in compressed repository 106. The new documents also feed through indexer 108, which adds terms from the new documents into inverted index 110.
The inverted index 110 illustrated in FIG. 1 can be used to efficiently identify specific terms in documents. However, because the inverted index 110 loses the ordering of the terms, searches that match multi-word portions of the document (as in a phrase search) would require position information (offsets) for the individual terms to be retrieved and aligned in order to match the proper ordering required by the query. This process can be time consuming.
Furthermore, storing the documents in both inverted index form and compressed form is wasteful in terms of storage space because these forms largely contain the same information. Note that this wastefulness leads to considerable additional cost for storage when billions of web pages are stored on the system.
Hence, what is needed is a method and an apparatus for compressing documents in a manner that supports both searching and decompression of portions of documents without the above-described problems.