The capability of organizing information has grown along with the ever-increasing availability of information. A vast source of available information may be found on internet-related networks (e.g. the World Wide Web (Web)) or other Internet sources. The Internet is an extensive network of computer networks through which information is exchanged by methods well known to those in the art (e.g. the use of TCP and IP protocols, etc). The Internet permits users to send and receive data between computers connected to this network. This data may include web sites, home pages, databases, text collections, audio, video or any other type of information made available over the Internet from a computer server connected to the Internet. This information may be referred to as articles or documents, and may include, a web page, data on a web page, attachments to a web page, or other data contained in a storage device (e.g., database).
Making sense of such a very large collection of documents, and foraging for information in such environments, is difficult without specialized aids. One such aid to assist in locating information is the use of key terms. That is, the articles may include key terms representing selected portions of the information contained in the article. These key terms are available over the Internet to other computers and permit these other computers to locate the article.
To locate articles on the Internet, a user of a remote computer searches for the key terms using a search program known as a search engine. Search engines are programs that allow the remote user to type in one or more search terms. The search engine then compares the search query with the key terms from the articles and retrieves at least a portion of the articles having key terms that match the search query. The search engine will then display to the user the portion of the article such as the title. The user can then scroll through these retrieved portions of the articles and select a desired article.
Early key-term search engines have exhibited serious drawbacks. For example, to increase exposure of a particular document, the document provider may use as many search terms as are possibly related to the article. In fact, some articles or search engines use every word in the article as key terms. As a result, search engines will retrieve many articles that are unrelated, or only peripherally related, to the subject matter that the user desires to find through a combination of search terms. Additionally, many users of such search engines are not skilled in formulating key-term search queries and produce extremely broad searches that often retrieve thousands of articles. The user must then examine the excerpted information regarding each document to locate the desired information.
This drawback was addressed by the evolution of search engines to include the organization of information based upon the search activity of one or more users. Such schemes rank results based upon a consensus of user preferences instead of document-oriented parameters (e.g., text). One such scheme ranks documents according to an evolving score based upon the key terms used. That is, the documents receive a relevancy score relative to the key terms of the search query. As users enter search queries and select documents from among the list of documents the query produces, the relevancy scores of the documents are adjusted. The scores are used to organize the resulting list of documents for subsequent searches. Such schemes typically base relevancy, at least in part, on the number of “clicks” the document received (i.e., the number of times a document was selected). Such schemes, known generally as “popularity ranking schemes” or “click popularity schemes”, provide a search result list in which the highest ranked documents are those that attracted and satisfied the greatest number of previous users. Moreover, click popularity schemes generate results that reflect search context. For example, previous search schemes would return documents containing all of the query terms, but would not automatically exclude words that are not part of the query. Thus, a text-matching search for “Mexico” might return mostly results about “New Mexico.” A click popularity scheme search will reduce such erroneous results, as users seeking “Mexico” will generally refrain from clicking pages about “New Mexico” and will tend to click those pages that they discern are most relevant to “Mexico,” thus raising the relevancy of the desired documents.
Basing relevancy on the number of clicks may lead to erroneous results over time as information related to the query terms changes. For example, for the particular query of “democratic frontrunner,” documents referring to the early-stage frontrunner Howard Dean may have been selected numerous times in December 2003, but the user in March 2004 entering that query may have been anticipating results for John Kerry who was then leading. Additionally, top-ranked results generally receive disproportionately greater use resulting in increasingly skewed search results in which the top-ranked results may never be displaced.
Some of these drawbacks have been addressed by search engines that organize information provided in response to queries using numerous factors including time-based and use-based factors. For example, such a scheme may use the activity of previous users in response to particular queries to adjust the relevancy of the query response documents. Such user activity may include the number of clicks in conjunction with the timing of prior users' selections or use of particular information. Such schemes may also consider where in a prior results listing a particular document was ranked when prior users selected it, actual versus expected use frequency of a document, and how the selected documents were used.
Yet, many drawbacks still exist in the current schemes. For example, current schemes do not address the problem of very rare queries for which sufficient user activity data has not been compiled. In such cases, results may be poor or non-existent. Additionally, click results are dependent upon the quality and integrity of the data source. Current schemes fail to account for the wide variations in data source quality. Moreover, current schemes are subject to spurious influences that may affect the integrity of the search results. One exemplary scheme, in accordance with the prior art, attempts to address certain drawbacks by updating search engine results based upon user activity. This scheme is described in U.S. Pat. No. 6,421,675 entitled “Search Engine” which is hereby incorporated by reference herein, to provide a fuller description of the prior art and clearly distinguish features of various embodiments of the present invention.