1. Field of the Invention
The invention relates to a search engine, and, more particularly, to a search engine which maps crawled documents into tiers and then searches those tiers in a hierarchical manner.
2. Description of the Related Art
The World Wide Web (“WWW”) is a distributed database including literally billions of pages accessible through the Internet. Searching and indexing these pages to produce useful results in response to user queries is constantly a challenge. The device typically used to search the WWW is a search engine. Maintaining a working search engine is difficult because the WWW is constantly evolving, with millions of pages being added daily and existing pages continually changing. Additionally, the cost of search execution typically corresponds directly to the size of the index searched. To deal with the massive size and amount of data in the WWW, most search engines are distributed and use replication and partitioning techniques (all discussed below) to scale down the number of documents.
A typical prior art search engine 50 is shown in FIG. 1. Pages from the internet or other source 100 are accessed through the use of a crawler 102. Crawler 102 aggregates documents from source 100 to ensure that these documents are searchable. Many algorithms exists for crawlers and in most cases these crawlers follow links in known hypertext documents to obtain other documents. The pages retrieved by crawler 102 are stored in a database 108. Thereafter, these documents are indexed by an indexer 104. Indexer 104 builds a searchable index of the documents in database 108. Typical prior art methods for indexing include inverted files, vector spaces, suffix structures, and hybrids thereof. For example, each web page may be broken down into words and respective locations of each word on the page. The pages are then indexed by the words and their respective locations. A primary index of the whole database 108 is then broken down into a plurality of sub-indices (discussed below) and each sub-index is sent to a search node in a search node cluster 106.
In use, a user 112 enters a search query to a dispatcher 110. Dispatcher 110 complies a list of search nodes in cluster 106 to execute the query and forwards the query to those selected search nodes. The compiled list ensures that each partition is searched once. The search nodes in search node cluster 106 search respective parts of the primary index produced by indexer 104 and return sorted search results along with a document identifier and a score to dispatcher 110. Dispatcher 110 merges the received results to produce a final list displayed to the users 112 sorted by relevance scores. The relevance score is a function of the query itself and the type of document produced. Factors that are used for relevance include: a static relevance score for the document such as link cardinality and page quality, superior parts of the document such as titles, metadata and document headers, authority of the document such as external references and the “level” of the references, and document statistics such as query term frequency in the document, global term frequency, and term distances within the document.
Referring now to FIG. 2, a cluster 106 of search nodes is shown. For illustrative purposes, cluster 106 is shown in a matrix grouped in columns 122a, 122b, etc. and rows 124a, 124b, etc. In each column 122 of search nodes, the same set of indices is replicated for each respective search node. For example, the search node in column 122a, row 124a, includes the same subset of indices as the search node in column 122a, 124b. In each row 124 of search nodes, a different subset of indices is used. The indices are equally split so as to divide the amount of time for a search.
For example, the search node in column 122a, row 124a includes a different subset of indices than the search node in column 122b, row 124a. In each search node, “I” represents the index for the entire database 108, “S” corresponds to a search node, “Sn(In)” indicates that search node n holds sub-index n of the entire index I, and “Snm(In)” indicates that replication number m of search node n holds sub-index n of the entire index I.
Each query from dispatch 110 is sent to respective search nodes so that a single node in every partition is queried. For example, all the nodes in a row 122a, 122b, etc. are queried as the combination of these nodes represents that total index. That is, each row in cluster 120 is a set of search nodes comprising all the partitions of an entire index. The results are merged by dispatcher 110 and a complete result from the cluster is generated. By partitioning data in this way, the data volume is scaled. For example, if there are n columns, then the search time for each node is reduced basically by a factor of n—excluding the time used for merging results by dispatcher 110.
By replicating the search nodes, the query processing rate for each index is increased. In FIG. 2, all search nodes in each column hold the same index. This allows dispatcher 110 to rotate among the nodes in a column for each index partition when selecting a set of search nodes to handle an incoming query.
However, the inventors have determined that there is a highly skewed distribution of unique search queries in a typical search engine. For example, the top 25 queries may account for more than 1% of the total query volume. As a consequence, equally dividing a primary index into smaller sub-indices may not provide optimum results.
Therefore, there is a need in the art for a search engine that organizes its documents and indices in light of the distribution of search queries.