1. Field of the Invention
The invention generally relates to systems and methods for updating a search engine in a computer network, such as the Internet. More particularly the invention is directed to a system and method for improving the freshness of links identified by the search engine in response to a search query.
2. Description of the Related Art
Computer networks have become convenient and popular means for the exchange of information. An example of such computer networks is, of course, the Internet. The Internet is a vast, decentralized public computer network that allows communication between millions of computers around the world. The large volume of information on the Internet, however, creates daunting challenges for those desiring to identify and locate specific information.
For example, a part of the Internet known as the World Wide Web (xe2x80x9cthe Webxe2x80x9d) consists of millions of computers that store electronic files that may be accessed via the Internet. The computers and electronic files are respectively known as xe2x80x9cweb sitesxe2x80x9d and xe2x80x9cweb pages.xe2x80x9d Web pages are created to present all kinds of information, from commercial catalogs and advertisements, to scientific literature, to governmental regulations, etc. It has been reported that there are already more than a billion web pages, and the Web is expected to grow to 100 billion web pages within two years. Without the appropriate tools, finding specific information stored somewhere in the billions of web pages amounts to the proverbial task of finding a needle in a haystack.
A search engine is one of those tools that facilitates locating the desired information in a network such as the Web. A user usually accesses a web site that hosts a xe2x80x9csearch enginexe2x80x9d and submits one or more search queries related to the information sought. Generally, a search engine is a computer program that, when queried for information, retrieves either related information or pointers to the location of related information, or both, by evaluating its database. In the Web context, when a user submits a query, the search engine usually responds with a list of links pointing to information resources, typically web pages hosted on other web sites, that are derived from matching entries in the search engine""s database. As used herein, the term xe2x80x9clinkxe2x80x9d is generally any representation or symbol (e.g., an address) that points to the location of an information resource, such as a web page. For example, typically a link on the Web is a pointer found in one file which references another file. The link on the Web commonly refers to a Uniform Resource Locator (URL), the global address of documents and other resources on the Web.
However, because web pages, or the URLs pointing to them, may be modified at random times by their maintainers (xe2x80x9cweb mastersxe2x80x9d), often the search engine responds to the user""s request with URLs from its database that are outdated. When a webmaster changes the content of a web page, including adding or removing content or deleting the page altogether, a search engine database does not immediately reflect these changes. A typical search produces a large number of links that either point to a web site that does not exist, or to a web page that has been modified, moved or deleted. Consequently, when a user clicks on the outdated URL provided by a search engine, an error results and the user is unable to access the intended content. For this reason, search engines strive to keep track of the ever changing Web by continuously finding, indexing, and reindexing web pages. As used here, xe2x80x9cindexingxe2x80x9d means the storing of links pointing to information resources, as well as somexe2x80x94or allxe2x80x94of the data associated with the information resource.
Most, although not all, search engines utilize computer applications called xe2x80x9cspidersxe2x80x9d or xe2x80x9crobotsxe2x80x9d to index the myriad of web sites on the Internet and gather content information for their search engine""s databases. The term xe2x80x9ccontent informationxe2x80x9d as used here means either a URL or the data on the web page associated with the URL, or both. Inherently, a search engine robot indexes a significant number of all the information resources (e.g., web pages) in the Internet. For example, it has been reported that the search engines maintained by Inktomi Corporation and Google Inc. index nearly 500 and 200 million web pages, respectively.
Usually a robot updates the links in the search engine""s database in a sequential manner, i.e., starting at the first link and continuing to the last, then starting over again. The cycle time of most search engine robots, that is the time between sampling the same web site and incorporating any changes into the search engine""s database can be a significant period of timexe2x80x94as long as several months. Moreover, if a particular site is not accessible when a robot comes around to examine it, the robot will not index the web pages on that web site until some future time. In the worst case scenario, the URL pointing to the web site (including any URLs to any of its web pages) could be excluded from the search engine""s database entirely. As more web sites come online, the amount of time for a search engine""s robot operation to cover the entire Internet continues to increase, requiring additional computing resources.
It is clear that the time-delay between indexing and reindexing any one content resource, e.g., a web page, leads to information stored in the search engine""s database that is stale, e.g., outdated or not xe2x80x9cfreshxe2x80x9d URLs. Currently, over a given time period, an equal amount of computing resources are dedicated to refreshing each link stored in the search engine""s database. However, given the large number of dynamically changing Internet resources to monitor, and only limited resources (bandwidth and storage) available to do the monitoring, there is a need in the relevant technology for a system and a method of deciding which resources should be updated first and when.
The invention disclosed here seeks to overcome the problem of stale information in a search engine""s database by providing a system and method of improving the freshness of the contents of the database. In one embodiment, the invention provides a method of updating contents of a search engine database comprising a plurality of links each associated with a resource. The method may comprise determining popularity of each of the plurality of links based, at least in part, on the frequency of retrieval of the link by the search engine in response to a search request. The method may further comprise determining whether the popularity of the link exceeds a predetermined popularity threshold, the method may further include updating information associated with the link, provided that the popularity exceeds the popularity threshold.
In another embodiment, the invention provides a system for updating contents of a search engine database comprising a plurality of links each associated with a resource. The system may comprise a first module that is configured to determine the popularity of each of the plurality of links based, at least in part, on the frequency of retrieval of said link by a search engine in response to a search request; the first module may further determine whether the popularity of the link exceeds a predetermined popularity threshold. The system may further comprise a second module, operationally connected to the first module, that is configured to access the search engine database and the resource for updating information associated with the link, provided that the popularity exceeds the popularity threshold.
Another aspect of the invention is a system for updating contents of a search engine database comprising a plurality of links each associated with a resource. The system of this embodiment may comprise means for determining popularity of each of the plurality of links based, at least in part, on the frequency of retrieval of said link by the search engine in response to a search request. The system may further comprise means for determining whether the popularity of the link exceeds a predetermined popularity threshold. The system may further includes means for updating information associated with the link, wherein updating information is performed if the popularity exceeds the popularity threshold.
In another embodiment, the invention provides a method of updating contents of a search engine database comprising a plurality of links each associated with a resource. The method of this embodiment may comprise determining popularity of each of the plurality of links based, at least in part, on the frequency of retrieval of said link by the search engine in response to a search request. The method may further comprise updating information associated with at least one of the plurality of links, wherein the most popular link among not-yet-updated links of the plurality of links is selected first for updating.