1. Field of the Invention
The present invention generally relates to information retrieval systems and methods. In particular, the present invention relates to information retrieval systems and methods that identify and rank resources in response to a query.
2. Background
Generally speaking, an information retrieval system is an automated system that assists a user in searching for and obtaining access to information. A search engine is one type of information retrieval system. A search engine is designed to help users search for and obtain access to information that is stored in a computer system or across a network of computers. Search engines help to minimize the time required to find information as well the amount of information that must be consulted. The most public, visible form of a search engine is a Web search engine which is designed to search for information on the World Wide Web. Some well-known Web search engines include Yahoo!® Search (www.yahoo.com), provided by Yahoo! Inc. of Sunnyvale, Calif., Bing™ (www.bing.com), provided by Microsoft® Corporation of Redmond, Wash., and Google™ (www.google.com), provided by Google Inc. of Mountain View, Calif.
A search engine provides an interface that enables a user to specify criteria about one or more resources of interest and then operates to find resources that match the specified criteria. The criteria are referred to as a query. In the case of text search engines, the query is typically expressed as a set of words that identify a desired concept to which one or more resources relate. The list of resources identified by a search engine as meeting the criteria specified by the query is typically sorted, or ranked. Ranking resources by relevance (from highest to lowest) reduces the time required to find the desired information.
To provide a set of matching resources that are sorted according to some criteria quickly, some search engines are designed to collect metadata about the group of resources under consideration beforehand and store such metadata in an index. The metadata associated with a resource typically constitutes less information than the full content of the resource itself. Consequently, some search engines only store the indexed information and not the full content of each resource. Such search engines may provide a user with a method of navigating to the actual resources in a search engine results page. Alternatively or additionally, a search engine may store a copy of each resource in a cache so that users can see the state of the resource at the time it was indexed, for archive purposes, or to make repetitive processes work more efficiently and quickly.
Web search engines serve a wide spectrum of user information needs. These include, for example, handling navigational queries (e.g., queries such as “yahoo” that refer to a destination on the Web) and transactional queries (e.g., queries such as “red shoes” that refer to a product or service in which a user is interested) amongst other query classes. Although the different classes of information needs constitute varying sizes of the total queries issued to a Web search engine, an effective system will support each.
Recency-sensitive queries refer to queries where the user expects resources that are both topically relevant as well as fresh. For example, consider the occurrence of some natural disaster such as an earthquake. A user interested in this topic desires resources that are both relevant and fresh. For example, a relevant resource may be a document that discusses the earthquake while a fresh resource may be a document that provides novel information about the earthquake.
A Web search engine must effectively retrieve resources for recency-sensitive queries because failures can be more severe than with other query classes. First, the desire for information is immediate. A user searching for recent information might only want an update on a topic. The user might also have just heard of an event (e.g., a death) and be less willing to reformulate a query or scan a ranked list for relevant resources. Second, time sensitive queries are more likely to suffer from what is referred to as the zero recall problem. Time sensitive queries often refer to events for which resources have not yet been published or have been lightly published. Because the resource metadata indexed by Web search engines is typically derived from content fetched by a Web crawler, the freshness of the resources represented in the index will depend upon the crawl policy. Zero recall queries are detrimental because no amount of user effort—through reformulation or scanning—can find the relevant resources. In order to avoid catastrophic failures for recency-sensitive queries, a search engine needs not just an effective model of which queries are recency-sensitive but also algorithms for effectively retrieving fresh resources.
Even if a search engine were capable of retrieving fresh resources, such resources typically do not have highly effective features relating to long-term popularity and usage that can be used for ranking such as in-link statistics, Web page rank, click-based statistics, or the like. Thus, some method must also be provided for computing novel and effective features for ranking fresh resources which otherwise will have impoverished representations.