The Internet is a large network of computers, including both a large number of client devices and server devices. Among other functions, a server device sometimes provides, over the network, a document to a client device in response to a request sent by the client over the network. The request typically includes an address of the document. On the Internet, a uniform resource locator (URL) is often used to specify the address of the document, the URL identifying both the server and the particular document on the server that a client is requesting. The document may be one of any number of types of information that can be transmitted over the network, including text files, word processing files, audio clips, video clips, and any other type of electronic data. The collection of documents made available to client computers over the Internet in this way is commonly referred to as the World Wide Web (“the Web”).
A computer connected to the Internet may be a client device, a server device, or both. A special type of server device on the Internet is referred to as a search engine system. Search engine systems also exist on networks other than the Internet, for example on corporate intranets. A user of a client device who desires information from the Web, but is unsure of the URL of any or all relevant documents, typically submits a request, referred to as a query, to a search engine. A query includes one or more terms that describe the type of information in which the user of the client device is interested. The search engine typically maintains a database of documents on the Web. Each database may include key terms, which may be words or any type of electronically storable data, and corresponding URLs of documents that contain the key terms. More generally, in place of or in addition to key terms, the database may store features of documents. Some features are values that directly represent properties of the document, an example of a feature being the length of the document. Other features enable some type of comparison between a document and a query, the frequency with which a given term in the query appears in the document serving as one example of such a comparison.
In response to a query submitted by a user of a client device, a search engine typically determines, based on its database, a subset of the documents in the index that are relevant to the query. Additionally, the search engine system typically includes a ranking function that estimates the relevance of each document in the subset to the query, generating a “relevance score” for each document in the subset relative to the query. Finally, the ranking function creates a search engine result, including an ordered list of entries. Each entry corresponds to one of the documents in the subset. An entry includes the URL of a corresponding document, so that the user can request the document from the Web, and a position of the entry in the list. The list is ordered so that documents having positions nearer to the beginning of the list (i.e., documents having numerically lower positions) have higher relevance scores (i.e., the relevance scores monotonically decrease as one moves from the beginning of the list towards its end).
The creation of a database of documents that accurately represents the content of the documents on the Web is a difficult problem. First, there are a large number of documents on the Web; estimates of the number of documents are currently in the billions. This creates difficulties in creating a single database that can both store information about all the documents and quickly retrieve that information when needed. Second, the content of the documents on the Web is dramatically diverse. Documents are produced by authors of varying skill, from professional reporters to young children, are composed in a variety of languages (sometimes employing different alphabets and electronic encoding schemes thereof) and are produced for a wide variety of purposes, from recreational use to electronic commerce. This makes the determination of which key terms and features of documents to include in a database a difficult problem. Finally, the contents of the documents on the Web, as well as the locations of the documents themselves, change rapidly. Various “crawling” strategies have been employed to mitigate this difficulty, each having its own respective advantages and disadvantages. Thus, the use of more than one database may be advantageous in a search engine system for use with a large, diverse, and time-varying collection of documents (such as the collection of documents on the Web).
Even if a single database were developed that accurately and efficiently characterized the documents available on the Web, the determination of the relevance score of a particular document in the database to a query would remain a difficult problem. The relevance score of a document is used to determine the position of the corresponding entry in the search engine result. A user typically only examines the first three or four entries in a search engine result, so accurate relevance score determination, at least for highly relevant documents, is an important factor in the user's satisfaction with the search engine result. A user's perception of the relevance of a particular document to a query is difficult to accurately replicate in a single algorithm for determining a relevance score. For this reason, it may be desirable to have more than one method available for determining a relevance score in a search engine system. For example, one method may be well-suited to determine the relevance scores of documents written in a single language (for example, English) and a second method best suited to the determine the relevance scores of documents in a second language (for example, Chinese). When a document in the database contains content in both languages, however, it will be difficult to decide which ranking function to use. As another example, the search engine system may include more than one database and have a separate ranking function for each database. In this way, the individual ranking functions may by optimized for determining relevance scores for documents from their respective databases. Statistical and machine learning techniques are increasingly used to perform this type of optimization. Thus, there is a need for a method and system to blend the search engine results that come from more than one ranking function.
Given the above background, it is desirable to devise a method and system for combining the search engine results from one or more search sources, each search source possibly employing a different database, ranking function, or both. In particular, it is desirable to devise a method for determining a blended search engine result in such a way that a user's perception of the relevance of, say, the top three documents in the blended search engine is superior, or at least not inferior, to the user's perception of the relevance of the top three documents from any of the individual search sources.