1. Field of the Invention
The present invention relates to search engines for handling contextual queries over a set of documents.
2. Description of Related Art
Search engines often include features that allow a user to find words in specific contexts. For example, words used in a common field (abstract, title, body, etc.) in documents that make up the corpus being searched are often subject of queries. Some search engines are set up to search for words used in grammatical contexts, such as subjects or objects in sentences. For documents written in markup languages, such as XML or HTML, words used that are parts of tags can be searched for using search engines. Search engines have also been implemented to search for words used as part of an entity name, like the name of a person, place or product.
Also, search engines routinely encounter the problem of handling very frequent words independent of context, referred to as stop words. Stop words like “the”, “of”, “and”, “a”, “is”, “in” etc., occur so frequently in the corpus of documents subject of a search index that reading and decoding them at query time becomes a very time-consuming operation. Most search engines therefore drop these words during a keyword query and hence the name “stopwords.” However, for a search engine to support phrase queries, these stop words must be evaluated. As an example, consider a phrase query like “University of Georgia”. This query must return with documents matching all the three words in the same order. Therefore, the search engine must deal with the stop word “of”.
In a survey of web server search logs, it has been found that 20% of all phrase queries contain a frequently occurring word like “the”, “to”, “of” etc. Thus, solving this issue of phrase query performance is paramount to any search engine. Likewise, contextual searching occupies a significant proportion of the queries for many types of search engines.
Performance of phrase queries and other contextual searches presents serious challenges indexes used for various searchable contexts and for stop words occupy a significant percentage of the search index data on disk. This taxes system performance in 3 ways:                Disk performance on large disk reads from the indexes becomes a serious bottleneck.        System processor performance in decompressing this data fetched from the indexes gets impacted.        System memory usage is also increased.        
Different methodologies can be used to speed up phrase queries. One method is to use specialized indexes called skip lists that allow selective access of the index postings. This method has the unfortunate side effect of further increasing both the index size and the complexity of the indexing engine.
Another technique that can be used for stop words is called “next word indexing”. In this technique, words following stop words are coalesced with the stop word into one word and stored as a separate word in the index. For instance, in the sentence fragment “The Guns of Navarone” in a document, making an index entry by coalescing the stop words and their subsequent words creates the new words “TheGuns” and “ofNavarone”. These words are stored separately in the index. For a phrase query “The Guns of Navarone”, the search engine converts the four-word query into a 2-word phrase query “The Guns ofNavarone”. The speed up is enormous here as the number of postings for the word “TheGuns” and “ofNavarone” will be quite small when compared to that for the words “The” and “of”.
There is a mechanism of “next-word” indexes (also referred as Combined indexes) published by Hugh E. Williams, Justin Zobel, Dirk Bahle, “Fast Phrase Querying with Combined Indexes,” Search Engine Group, School of Computer Science and Information Technology, RMIT University, GPO Box 2476V, Melbourne 3001, Australia. 1999.
This next-word indexing technique, though very interesting, is not preferable because it can increase the number of unique words in the search engine by more than a few million entries. This creates slowdowns both in indexing and querying.
Traditionally contextual matching requires multiple index structures over the documents which consume significant resources. The problem is exacerbated when complex matching is needed, over several contextual parameters and stop words.
It is desirable to provide systems and methods for speeding up the indexing and querying processes for search engines, and to otherwise make more efficient use of processor resources during indexing and querying large corpora of documents.