1. Field of the Invention
Systems and methods consistent with the principles of the invention relate generally to information searching and, more particularly, to determining the freshness of retrieved documents and possibly using this freshness to score the retrieved documents.
2. Description of Related Art
Existing information searching systems use search queries to search through aggregated data to retrieve specific information that corresponds to the received search queries. Such information searching systems may search information stored locally, or in distributed locations. The World Wide Web (“web”) is one example of information stored in distributed locations. The web contains a vast amount of information, but locating a desired portion of that information can be challenging. This problem is compounded because the amount of information on the web and the number of new users inexperienced at web searching are growing rapidly.
Search engines attempt to return hyperlinks to web documents in which a user is interested. Generally, search engines base their determination of the user's interest on search terms (called a search query) entered by the user. The goal of the search engine is to provide links to high quality, relevant results to the user based on the search query. Typically, the search engine accomplishes this by matching the terms in the search query to a corpus of pre-stored web documents. Web documents that contain the user's search terms are “hits” and are returned to the user.
Frequently, web documents that are returned as “hits” to the user include out-of-date documents. If the freshness of web documents were reliably known, then the known freshness could be used in the ranking of the search results to avoid returning out-of-date web documents in the top results. Currently, however, a reliable freshness attribute for web documents does not exist. HTTP supports a “last-modified-since” attribute that indicates the day a last modification was made to a corresponding web document. This attribute, however, is optional in HTTP and is not used by all web documents. Additionally, the data indicated in the HTTP “last-modified-since” attribute may not be correct.