1. Field of the Invention
The present invention is directed to spider engines and, in particular, to regulating the rate of data retrieval by a spider engine.
2. Related Art
xe2x80x9cWeb crawlersxe2x80x9d, xe2x80x9crobotsxe2x80x9d, or xe2x80x9cspider enginesxe2x80x9d are programs used to automatically search the Internet for web pages or documents of interest. The information found by the spider engine may be collected, cataloged, and otherwise used by search engines. For example, a spider engine may be directed to search for and collect particular types of data, such as product catalog information, or may randomly search and catalog all found web pages to create a web index. The spider engine may enter a particular web site, and search one or more web pages of the web site for information of interest. The web site being searched may maintain a large number of web pages. Hence, searching with a spider engine may entail downloading, via the Internet, hundreds, thousands, and even more pages of information in a relatively short amount of time, from a single web site server.
Searching a web site in this manner with a spider engine may cause a web site server to become heavily loaded with web page requests. A web site server may be physically limited to supporting a particular amount of web page requests at any one time. The loading due to requests from a single spider engine may approach this web page request limit, and impair the web server""s ability to respond to other requests for information during this period. This overloading may be detrimental to the web site provider""s goal of making information available to interested parties, and may discourage interested parties from visiting the web site because they receive denials of service. Hence, what is needed is a method and system for limiting such web site requests of a web server by a spider engine, while still yielding acceptable search results.
The present invention prevents a spider engine from overloading a web site with web page requests. The present invention includes a timing module that is coupled to the spider engine. The timing module of the present invention prevents the overloading of a web site server. The timing module monitors data transfer between the web site server and the spider engine, and provides the spider engine with information to adjust the data transfer rate accordingly. The timing module can insert a xe2x80x9cwaitxe2x80x9d state of a calculated length of time between data requests by the spider engine. By controlling this wait time inserted between data requests, the timing module is able to adjust the overall data transfer rate between the web site server and the spider engine to a desired level.
The present invention is directed to a system for retrieving web-site based information using a spider engine at a target bandwidth. A timing module is coupled to or otherwise associated with the spider engine. The timing module includes a data receiver, a bytes accumulator, a current time determiner, a wait time calculator, and a wait time transmitter. The data receiver receives a target bandwidth, BT, and at least one bytes count from the spider engine. The bytes accumulator accumulates the at least one bytes count received from the spider engine to create an aggregate bytes count, bytesAGG. The current time determiner determines a start time, TSTART, and current time, TNOW, for the at least one received bytes count. The wait time calculator calculates a wait time as a function of bytesAGG, BT, and an elapsed time (TNOWxe2x88x92TSTART). The wait time is the amount of time the spider engine should wait to initiate a next web-site data retrieval to reach the target bandwidth. A wait time transmitter transmits the wait time, TWAIT, calculated by the wait time calculator to the spider engine.
The present invention is further directed to a method of retrieving web site based information at a target bandwidth. A target bandwidth, BT, is received. The target bandwidth, BT, defines a desired information transfer rate with the web site. A wait time, TWAIT, is calculated. Data retrieval from the web site is delayed by the calculated wait time so that the data is retrieved at the desired target bandwidth, BT.
A start time, TSTART, is calculated. Retrieval of data is initiated from a remote web-site across a network. A number of bytes received is detected. An aggregate bytes count, bytesAGG, is incremented by the number of bytes received. A current time, TNOW, is calculated. The wait time, TWAIT, is calculated. TWAIT may be calculated according to the equation:
TWAIT=(bytesAGG)/BTxe2x88x92(TNOWxe2x88x92TSTART)