1. Field of the Invention
The invention generally relates to systems and methods for updating a search engine in a computer network, such as the Internet. More particularly the invention is directed to a system and method for improving the freshness of links identified by the search engine in response to a search query.
2. Description of the Related Art
Computer networks have become convenient and popular means for the exchange of information. An example of such computer networks is, of course, the Internet. The Internet is a vast, decentralized public computer network that allows communication between millions of computers around the world. The large volume of information on the Internet, however, creates daunting challenges for those desiring to identify and locate specific information.
For example, a part of the Internet known as the World Wide Web (“the Web”) consists of millions of computers that store electronic files that may be accessed via the Internet. The computers and electronic files are respectively known as “web sites” and “web pages.” Web pages are created to present all kinds of information, from commercial catalogs and advertisements, to scientific literature, to governmental regulations, etc. It has been reported that there are already more than a billion web pages, and the Web is expected to grow to 100 billion web pages within two years. Without the appropriate tools, finding specific information stored somewhere in the billions of web pages amounts to the proverbial task of finding a needle in a haystack.
A search engine is one of those tools that facilitates locating the desired information in a network such as the Web. A user usually accesses a web site that hosts a “search engine” and submits one or more search queries related to the information sought. Generally, a search engine is a computer program that, when queried for information, retrieves either related information or pointers to the location of related information, or both, by evaluating its database. In the Web context, when a user submits a query, the search engine usually responds with a list of links pointing to information resources, typically web pages hosted on other web sites, that are derived from matching entries in the search engine's database. As used herein, the term “link” is generally any representation or symbol (e.g., an address) that points to the location of an information resource, such as a web page. For example, typically a link on the Web is a pointer found in one file which references another file. The link on the Web commonly refers to a Uniform Resource Locator (URL), the global address of documents and other resources on the Web.
However, because web pages, or the URLs pointing to them, may be modified at random times by their maintainers (“web masters”), often the search engine responds to the user's request with URLs from its database that are outdated. When a webmaster changes the content of a web page, including adding or removing content or deleting the page altogether, a search engine database does not immediately reflect these changes. A typical search produces a large number of links that either point to a web site that does not exist, or to a web page that has been modified, moved or deleted. Consequently, when a user clicks on the outdated URL provided by a search engine, an error results and the user is unable to access the intended content. For this reason, search engines strive to keep track of the ever changing Web by continuously finding, indexing, and re-indexing web pages. As used here, “indexing” means the storing of links pointing to information resources, as well as some—or all—of the data associated with the information resource.
Most, although not all, search engines utilize computer applications called “spiders” or “robots” to index the myriad of web sites on the Internet and gather content information for their search engine's databases. The term “content information” as used here means either a URL or the data on the web page associated with the URL, or both. Inherently, a search engine robot indexes a significant number of all the information resources (e.g., web pages) in the Internet. For example, it has been reported that the search engines maintained by Inktomi Corporation and Google Inc. index nearly 500 and 200 million web pages, respectively.
Usually a robot updates the links in the search engine's database in a sequential manner, i.e., starting at the first link and continuing to the last, then starting over again. The cycle time of most search engine robots, that is the time between sampling the same web site and incorporating any changes into the search engine's database can be a significant period of time—as long as several months. Moreover, if a particular site is not accessible when a robot comes around to examine it, the robot will not index the web pages on that web site until some future time. In the worst case scenario, the URL pointing to the web site (including any URLs to any of its web pages) could be excluded from the search engine's database entirely. As more web sites come online, the amount of time for a search engine's robot operation to cover the entire Internet continues to increase, requiring additional computing resources.
It is clear that the time-delay between indexing and reindexing any one content resource, e.g., a web page, leads to information stored in the search engine's database that is stale, e.g., outdated or not “fresh” URLs. Currently, over a given time period, an equal amount of computing resources are dedicated to refreshing each link stored in the search engine's database. However, given the large number of dynamically changing Internet resources to monitor, and only limited resources (bandwidth and storage) available to do the monitoring, there is a need in the relevant technology for a system and a method of deciding which resources should be updated first and when.