1. Field of the Invention
The present invention relates to a scalable method for collaborative web crawling and information processing. More particularly, the invention concerns a distributed collection of web crawlers used to cover a large portion of cyberspace where the crawlers share the overall cyberspace crawling and collaborate to maximally utilize computing resources.
2. Description of the Related Art
Cyberspace is a popular way for people and industries to rapidly gather information. However, because of the immense amount of information available in cyberspace, automatic information gathering, screening, and delivering systems have become a necessity.
One such system is the Grand Central Station (GCS) system being developed at the IBM Almaden Research Center in San Jose, Calif. This system combines numerous aspects of information discovery and dissemination into a single, convenient system. GCS performs many functions by providing an infrastructure that supports the discovery and tracking of information in a digital domain such as cyberspace, and disseminates these discoveries to those who have an interest.
One of the key components of virtually all information discovery system infrastructures accessing cyberspace (i.e. the Internet) is a Gatherer that systematically gathers data sources (crawls) and transforms or summarizes them into a single, uniform, metadata format. This format generally reflects the format found in the system used by the person requesting the information. Webcasting technology referred to as an "Internet push" is used to match the summarized information with users' profiles and re-channel each piece of information to those who need it.
To assist in gathering information, cyberspace data located at a particular site being reviewed is logically arranged into a graph or tree, commonly referred to as a directed graph. The Gatherer traverses this web-graph looking for desired information. Because of the sheer volume of data available, the graph reviewed might be very large in size. For example, a directed graph representing one million pieces of potentially interesting information would be enormous in size and complexity. A large graph would require a considerable amount of time for the Gatherer to process the information.
To make a Gatherer more efficient, a system that allows partitioning of a web-space directed graph is needed. Preferably, the system would also allow "team-crawling," where web-space information could be gathered using multiple processors assigned to crawling parts of the same space. However, for such a partitioning to work, problems encountered with automatically partitioning the cyberspace for load balancing among gathering processors needs to be overcome. This is a different and much more challenging problem than discussed in current traditional graph partitioning problem studies dealing with very large scale integrated (VLSI) circuit design and parallel scientific computing.
For example, one difficulty comes from the fact that a web-space directed graph, used to model the information at the site, is usually not discoverable before the crawling occurs. This is because web sites are dynamic, that is, they are always changing, having information added and deleted up to the point the crawling actually takes place. This constant changing of the information--and therefore the directed graph used to model the information--prevents directly applying the previously mentioned graph partitioning methods that are designed for static (non-changing) graphs. This lack of full knowledge of a web-graph construct before a web space is partitioned also requires the amount of load and the number of hyperlinks across a partition to be changeable at any stage of collaborative crawling, and hence dynamic re-partitioning and load re-balancing would also be necessary.
Another problem that would need to be overcome is the addressing problem that arises in attempting to partition a web-space. For example, given a uniform resource locator (URL)--a commonly used designator for the location of a piece of information (object)--a quick decision needs to be made as to which partition it would belong. Depending upon the partition, it would then be sent to a designated processor for crawling and processing. Further, because the web-graph is dynamic, a problem can arise in simply organizing a partition.