Field of the Invention
Implementations described herein relate generally to information searching and, more particularly, to deriving and using document quality signals from search query streams.
Description of Related Art
Existing information searching systems use search queries to search through aggregated data to retrieve specific information that corresponds to the received search queries. Such information searching systems may search information based locally, or in distributed locations. The World Wide Web (“web”) is one example of information in distributed locations. The web contains a vast amount of information, but locating a desired portion of that information can be challenging. This problem is compounded because the amount of information on the web, and the number of new users inexperienced at web searching, are growing rapidly.
Search engines attempt to return hyperlinks to web documents in which a user is interested. Generally, search engines base their determination of the user's interest on search terms (e.g., in a search query provided by the user). The goal of the search engine is to provide links to high quality, relevant results to the user based on the search query. Typically, the search engine accomplishes this by matching the terms in the search query to a corpus of pre-stored web documents. Web documents that contain the user's search terms are considered “hits” and are returned to the user.
To return the “best” results of a search, it is important to measure, in some fashion, the quality of documents, such as web documents. One existing document quality measurement technique calculates an Information Retrieval (IR) score that is a measure of how relevant a document is to a search query. The IR score can be weighted in various ways. For example, matches in a document's title might be weighted more than matches in a footer. Similarly, matches in text that is of larger font or bolded or italicized may be weighted more than matches in normal text. A document's IR score may be influenced in other ways. For example, a document matching all of the terms of the search query may receive a higher score than a document matching one of the terms. All of these factors can be combined in some manner to generate an IR score for a document that may be used in determining a quality of the results from an executed search.
Scores derived from an existing link-based document ranking algorithm may additionally be used in conjunction with IR scores. PageRank is one existing global, link-based document ranking algorithm that derives quality signals from the link structure of the web. Often, however, link structure may be unavailable, unreliable, or limited in scope, thus, limiting the value of using PageRank in ascertaining the relative quality of some documents.