1. Field of the Invention
This disclosure relates to a system and method for generating an approximation of a search engine ranking algorithm.
2. Description of the Related Art
Referring to FIG. 1, the World Wide Web (“WWW”) is a distributed database including literally billions of pages accessible through the Internet. Searching and indexing these pages to produce useful results in response to user queries is constantly a challenge. A search engine is typically used to search the WWW.
A typical prior art search engine 20 is shown in FIG. 1. Pages from the Internet or other source 22 are accessed through the use of a crawler 24. Crawler 24 aggregates pages from source 22 to ensure that these pages are searchable. Many algorithms exist for crawlers and in most cases these crawlers follow links in known hypertext documents to obtain other documents. The pages retrieved by crawler 24 are stored in a database 36. Thereafter, these pages are indexed by an indexer 26. Indexer 26 builds a searchable index of the pages in a database 34. For example, each web page may be broken down into words and respective locations of each word on the page. The pages are then indexed by the words and their respective locations.
In use, a user 32 sends a search query to a dispatcher 30. Dispatcher 30 compiles a list of search nodes in cluster 28 to execute the query and forwards the query to those selected search nodes. The search nodes in search node cluster 28 search respective parts of the index 34 and return search results along with a document identifier to dispatcher 30. Dispatcher 30 merges the received results to produce a final result set displayed to user 32 sorted by ranking scores based on a ranking function.
The ranking function is a function of the query itself and the type of page produced. Factors that are used for relevance include hundreds of features extracted, collected or identified for each page including: a static relevance score for the page such as link cardinality and page quality, superior parts of the page such as titles, metadata and page headers, authority of the page such as external references and the “level” of the references, the GOOGLE page rank algorithm, and page statistics such as query term frequency in the page, words on a page, global term frequency, term distances within the page, etc.
The use of search engines has become one of the most popular online activities with billions of searches being performed by users every month. Search engines are also a starting point for consumers for shopping and various day to day purchases and activities. With billions of dollars being spent by consumers online, it has become ever more important for web sites to organize and optimize their web pages in an effort to be more visible and accessible to users of a search engine.
As discussed above, for each web page, hundreds of features are extracted and a ranking function is applied to those features to produce a ranking score. A merchant with a web page would like his page to be ranked higher in a result set based on relevant search keywords compared with web pages of his competitor for the same keywords. For example, for a merchant selling telephones, that merchant would like his web page to acquire a higher ranking score, and appear higher in a result set produced by a search engine, based on the keyword query “telephone” than the ranking scores of web sites of his competitors for the same keyword. There are some prior art solutions available to guess the ranking algorithm used by a search engine and to provide recommendations about improvements that can be made to web pages so that the ranking score for a web page relating to particular keywords may improve. However, most of these systems use manual, human judgment and historical knowledge about search engines. Humans must be trained to perform this analysis. The basis for these judgments are mostly guesses or arrived at by trial and error. Consequently, most prior art solutions are inaccurate, time consuming, and require expensive human capital. Moreover, these solutions are available only for specific search engines and are not immune to changes in search or ranking algorithms used by known search engines nor do they have the ability to adapt to new search engines.