Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing information systems such as the Internet. Generally, search engines create an index that relates documents (or “pages”) to the individual words present in each document. The index is typically stored as an inverted index, in which, for each unique term in the corpus, there is stored a posting list identifying the documents that contain the word.
A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document. Very generally, this is done by decomposing the query into its individual terms, and the accessing the respective posting lists of the individual terms. The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like. The retrieved documents are then presented to the user, typically in their ranked order, and without any further grouping or imposed hierarchy. In some cases, a selected portion of a text of a document is presented to provide the user with a glimpse of the document's content.
Direct “Boolean” matching of query terms has well known limitations, and in particular does not identify documents that do not have the query terms, but have related words. For example, in a typical Boolean system, a search on “Australian Shepherds” would not return documents about other herding dogs such as Border Collies that do not have the exact query terms. Rather, such a system is likely to also retrieve and highly rank documents that are about Australia (and have nothing to do with dogs), and documents about “shepherds” generally.
The problem here is that conventional systems index documents are based on individual terms, rather than on concepts. Concepts are often expressed in phrases, such as “dark matter,” “President of the United States,” or idioms like “under the weather” or “dime a dozen”. At best, some prior systems will index documents with respect to a predetermined and very limited set of ‘known’ phrases, which are typically selected by a human operator. Indexing of phrases is typically avoided because of the perceived computational and memory requirements to identify all possible phrases of say three, four, or five or more words. For example, on the assumption that any five words could constitute a phrase, and that a large corpus would have at least 200,000 unique terms, there would be approximately 3.2×1026 possible phrases, clearly more than any existing system could store or otherwise programmatically manipulate. A further problem is that phrases continually enter and leave the lexicon in terms of their usage, much more frequently than new individual words are invented. New phrases are always being generated, from sources such technology, arts, world events, and law. Other phrases will decline in usage over time.
Some existing information retrieval systems attempt to provide retrieval of concepts by using co-occurrence patterns of individual words. In these systems a search on one word, such as “President” will also retrieve documents that have other words that frequently appear with “President”, such as “White” and “House.” While this approach may produce search results having documents that are conceptually related at the level of individual words, it does not typically capture topical relationships that inhere between co-occurring phrases themselves.
Another problem with existing individual term based indexing systems lies in the arrangement of the server computers used to access the index. In a conventional indexing system for large scale corpora like the Internet, the index comprises the posting lists for upwards of 200,000 unique terms. Each term posting list can have hundreds, thousands, and not infrequently, millions of documents. The index is typically divided amongst a large number of index servers, in which each index server will contain an index that includes all of the unique terms, and for each of these terms, some portion of the posting list. A typical indexing system like this may have upwards of 1,000 index servers in this arrangement.
When a given query with some number of terms is processed then in such an indexing system, it becomes necessary to access all of the index servers for each query. Thus, even a simple single word query requires each of the index servers (e.g., 1,000 servers) to determine whether it contains documents containing the word. Because all of the index servers must process the query, the overall query processing time is limited by the slowest index server.