1.0 Field of the Invention
The present invention relates to a method, system and computer program product implementing the method, for processing text search queries in a collection of documents.
2.0 Description of the Related Art
The purpose of a text search query is typically to find those documents in a collection of documents that fulfill certain criteria, called search conditions, such as those documents which contain certain words. In many cases, the “relevance” of documents fulfilling the given search conditions has to be calculated as well. Most often, users are only interested in seeing the “best” documents that result from a text search query. As the size of document collections to be searched is constantly increasing, the efficiency of text search query processing becomes an ever more important issue.
Text search query processing for a fulltext search is typically based on “inverted indexes”. To generate inverted indexes for a collection of documents, all documents are analyzed to identify the occurring words or search terms as index terms together with their positions in the documents. In an “inversion step” this information is basically sorted so that the index term becomes the first order criteria. The result is stored in a full posting index comprising the set of index terms and a full posting list for each index term of the set. The posting list of an index term enumerates all occurrences of the index term in all documents of the collection.
FIG. 1 illustrates an example of a collection of documents 100 and a corresponding full posting index 200. The collection of documents 100 comprises three text documents doc1, doc2 and doc3. For simplicity, FIG. 1 does not show the fulltext of each document but only sequences of index terms a, b, c and d representing the occurrences of the index terms a, b, c and d in the fulltext of the corresponding document.
The index terms a, b, c and d form the set of index terms which the full posting index 200 is based on. It comprises a full posting list for each index term a, b, c and d, enumerating all occurrences of the corresponding index term in all documents doc1, doc2 and doc3 of the collection. In the example described herein, the occurrences of an index term are grouped by document. Typically, the posting lists are coded and compressed for storing.
The full posting index 200 can be used to process a query, for example, the query “find all documents containing the phrase ‘a’”. Therefore, the search engine looks up all positions for “a” and all positions for “b”. Then, the conditions whether “a” and “b” occur in the same document and whether “b” occurs in the position immediately after “a” are checked.
One issue associated with full posting indexes is that they tend to become very large. Even by applying appropriate compression techniques, a full posting index can easily be about 50 to 100% of the size of the original text document collection that has been indexed.
The overall runtime of text search query processing is mainly dominated by two factors, namely reading the corresponding index posting lists from disk and finding the next possible match in a posting list. For example, if “a” was found in document 2576, then the search engine should skip in the posting list of “b” to this document. Then a match is indicated if document 2576 is mentioned in the posting list of “b”. Otherwise if that position does not exist, no match was found. In some text search query processing techniques, necessary for processing this next( ) operation is the decompression and decoding of the posting lists.
The generation of additional access structures to posting lists, which allow relevant parts of long posting lists to be quickly addressed, is well-known. Therefore, the posting lists are no longer considered pure sequential data streams, but a sequence of indexed blocks. Thus, the irrelevant parts of a posting list can easily be skipped by addressing only blocks comprising the relevant parts of the posting list. In some cases, this selective addressing of pre-determined parts of a posting list can enhance the query performance dramatically. For example, the text search query “find all documents containing ‘A’ and ‘B’” will be processed. ‘A’ is a very rare term in the collection of documents, while ‘B’ is a frequent term. ‘A’ occurs only in documents 57, 23456 and 123456, while ‘B’ occurs in every 10th document on average. In this case it is advantageous to look up the occurrences of ‘A’, first. By using the additional access structure to the posting list of ‘B’, it is then possible to address directly those blocks of the posting list for ‘B’ that cover a range containing one of the occurrences of ‘A’. This processing saves I/O and decoding/matching expense for most of the occurrences in the posting list of ‘B’.
As exemplified, the use of additional access structures to long posting lists can improve query performance dramatically for those queries comprising at least some pretty restrictive condition. However, in cases in which all participating terms have a relatively high frequency, the approach does not help, since virtually any block in any posting list related to the search terms contains a hit candidate and needs to be examined.