1. Field of the Invention
This invention relates to methodologies for distributed web crawling and, more particularly, to a web crawling system that uses IP address and IP address range to assist in the efficient downloading of websites that belong to an IP address and/or IP address range.
2. Description of Background
A crawler or a robot, is defined as a software component that continuously visits websites on the Internet, or an Intranet, and downloads web pages from the websites and stores them in a local repository for further analysis and data mining. There are many types of crawlers, wherein each category of crawler can be configured to carry out specific functions. For example, there are focused or topical crawlers, this category of crawler limit their crawling to sites belonging to specific taxonomies, or geological regions. The crawlers are configured with such limitations in order to ensure that the sites being crawled are relevant to an overall goal of the system. Focus and topical crawling is typically implemented by specifying a web space that is to be crawled. A web space is determined according to utilization need, and comprises a set of allow and forbid rules, the rules being used to control the set of sites and directories that a focus crawler is allowed to visit. Configuring the web space for a focus crawler is very critical, as these rules are used to ensure that the focus crawler crawls all the pages that have been determined to be of interest.
The continual growth of the sites on the Internet leads to an increasing amount of challenges when defining the web space for a focus crawler. Therefore, there exists a need for a methodology to improve the efficiency in determining a web space, and further in implementing policies that are directed to configuring focus crawlers to crawl the defined web spaces.