The present invention relates generally to searching and gathering information on computer networks. More specifically, it relates to improved techniques for gathering large amounts of information from a large number of resources on a network, e.g., web crawling.
The world wide web (or simply, xe2x80x9cthe webxe2x80x9d) has enjoyed explosive growth in recent years, and now contains enormous amounts of information. This information is not centrally stored, but is distributed throughout millions of web servers. Moreover, the information is not static, but is constantly changing as web servers update, add, delete, or otherwise modify the information they make available to the network.
Popular web search engines allow users to quickly search the dynamic, distributed information on the web. Because searching the web directly would take an enormous amount of time, these search engines search a centralized index that summarizes the information stored on the web. An essential component of this approach to web searching is the task of gathering information from web servers (xe2x80x9cweb crawlingxe2x80x9d) and creating the searchable index from the gathered information.
Conventional web crawling typically involves a systematic exploration of the web to discover and gather information. Because the amount of information on the entire web is so large, web crawling consumes a proportionately large amount of time and network bandwidth, and places a large burden on both the crawlers and the servers. Moreover, because information on the web is constantly and unpredictably changing, the entire web crawling procedure is periodically repeated in order to keep the index information current. If this recrawling is not performed frequently enough, the web index will contain a large amount of obsolete information for some web sites (xe2x80x9cundercrawled sitesxe2x80x9d) whose content changes often. On the other hand, if recrawling is performed too frequently, valuable computational and network resources are wasted because a large portion of information has not changed at many web sites (xe2x80x9covercrawled sitesxe2x80x9d).
Conventional web crawling techniques also have the problem that they often do not discover all the information actually available on the web. Their normal strategy for discovering new information is to examine the hyperlinks within known documents. Some information, however, may not have direct hyperlinks from other documents, or may only have direct hyperlinks from other undiscovered documents. As a result, this information is not discovered, gathered, or included into the index used by the search engine.
The present inventors are not aware of any existing techniques by others that effectively address these problems. U.S. Pat. No. 5,860,071 and ATT Labs Tech. Report #97.23.1 discuss the ATandT Internet Difference Engine (AIDE). The primary purpose of AIDE is to track changes to web documents and display the update information to a user in a personalized manner. This and similar techniques are directed to the problem experienced by users who are browsing large collections of changing web documents, and want to be automatically notified when certain information of interest to them has changed. It is not directed to the problems associated with web crawling, and does not teach any solution to these problems.
To address the above problems with the current state of the art, the present inventors have developed a network repository service directory for efficient web crawling. The directory provides a centralized registry of web servers currently providing a repository service. The repository service supplements the functions of a web server to enable an increase in the efficiency of web crawling. In particular the repository service: (a) automatically maintains a file modification list that contains the names of files on the server that have been modified (i.e., added, deleted, or otherwise modified), together with the date and time of the file modification; and (b) provides a requesting crawler with the file modification list (or a portion of the list corresponding to a time period specified by the crawler). The repository service may also (c) limit or restrict access privileges of crawlers that do not request the file modification list, thereby protecting the server from overcrawling. The repository service enables a crawler to request the file modification list, and avoid unnecessarily recrawling files that have not been modified since its last visit, thereby preventing considerable waste of time, network bandwidth, server processing resources, and crawler processing resources. Using the file modification list, the crawler can remove all prior references to deleted files, and efficiently recrawl only those files that have been added or changed since the crawler last visited the web server.
The repository service solves the problems associated with both overcrawling and undercrawling. Because crawlers that request the file modification list will not unnecessarily recrawl unmodified files, they will no longer overcrawl web servers whose data is infrequently modified. Crawlers that do not request the file modification list, on the other hand, will have their access limited or restricted, preventing them from overcrawling the web server. The problem of undercrawling is solved by virtue of the increased efficiency in crawling. Because all unnecessary crawling is eliminated, resources are made available for more frequent crawling of information that actually is changing, as well as for other uses. Consequently, any index produced from the information gathered by the crawler will have more current information. The repository service also has the advantage that it informs the crawlers of all new web content. As a result, web crawlers will not miss documents that are not linked to known documents, and the information gathered by the crawler will be more complete.
In order to increase the advantageous use of repository services, a repository service directory (or xe2x80x9cmaster repository servicexe2x80x9d) is also provided. For example, when a new web server is initially connected to the internet, crawlers and other web sites are not automatically aware of its presence. Consequently, these new sites will initially be undercrawled and their content will remain unindexed until crawlers discover them. To address this problem, the master repository service provides a registry for web servers. New web sites that register with the master repository service are included in a directory. Crawlers query the master repository service and are then informed of the most recent servers on the web.
According to another aspect of the invention, the master repository service can include web site modification information for registered web servers. For example, the web site modification information may include, for each registered web server, the time and date of the most recent modification to the web content. Prior to a new crawling session, a crawler queries the master repository service and obtains a list of web servers whose web content has been modified since the time of the crawler""s last crawling session. Consequently, the crawler needs to visit only those web servers in the list provided by the master repository service. This technique, therefore, increases the efficiency of the crawler. In addition, it reduces the burden on web servers whose content is relatively constant.
In summary, the master repository service provides advantages to both web servers and web crawlers. Web servers enjoy quick visibility on the web, and are contacted by crawlers only when their web content has changed. Web crawlers, on the other hand, immediately know about new web sites, and need only spend time visiting web sites that have changed their content.