1. Field of the Invention
The present invention relates to search engines for handling word and phrase queries over a set of documents.
2. Description of Related Art
Search engines routinely encounter the problem of handling very frequent words, referred to as stopwords. Stopwords like “the”, “of”, “and”, “a”, “is”, “in” etc., occur so frequently in the corpus of documents subject of a search index that reading and decoding them at query time becomes a very time-consuming operation. Most search engines therefore drop these words during a keyword query and hence the name “stopwords.” However, for a search engine to support phrase queries, these stopwords must be evaluated. As an example, consider a phrase query like “University of Georgia”. This query must return with documents matching all the three words in the same order. Therefore, the search engine must deal with the stopword “of”.
In a survey of web server search logs, it has been found that 20% of all phrase queries contain a frequently occurring word like “the”, “to”, “of” etc. Thus, solving this issue of phrase query performance is paramount to any search engine.
Performance of such phrase queries presents serious challenges because stopwords occupy a significant percentage of the search index data on disk. This taxes system performance in 3 ways:                Disk performance on large disk reads from the indexes becomes a serious bottleneck.        System processor performance in decompressing this data fetched from the indexes gets impacted.        System memory usage is also increased.        
Different methodologies can be used to speed up phrase queries. One method is to use specialized indexes called skiplists that allow selective access of the index postings. This method has the unfortunate side effect of further increasing both the index size and the complexity of the indexing engine.
Another technique that can be used is called “next word indexing”. In this technique, words following stopwords are coalesced with the stopword into one word and stored as a separate word in the index. For instance, in the sentence fragment “The Guns of Navarone” in a document, coalescing the stopwords and their subsequent words creates the new words “TheGuns” and “ofNavarone”. These words are stored separately in the index. For a phrase query “The Guns of Navarone”, the search engine converts the four-word query into a 2-word phrase query “TheGuns ofNavarone”. The speed up is enormous here as the number of postings for the word “TheGuns” and “ofNavarone” will be quite small when compared to that for the words “The” and “of”.
There is a mechanism of “next-word” indexes (also referred as Combined indexes) published by Hugh E. Williams, Justin Zobel, Dirk Bahle, “Fast Phrase Querying with Combined Indexes,” Search Engine Group, School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne 3001, Australia. 1999.
This next-word indexing technique, though very interesting, is not preferable because it can increase the number of unique words in the search engine by more than a few million entries. This creates slowdowns both in indexing and querying.
It is desirable to provide systems and methods for speeding up the indexing and querying processes for search engines, and to otherwise make more efficient use of processor resources during indexing and querying large corpora of documents.