1. Field of the Invention
The present invention relates generally to a bandwidth control system and a method therefor capable of reducing traffic congestion on content servers, and more particularly to a method for controlling network traffic, a method and a device for content-crawling capable of reducing traffic congestion on content servers.
2. Description of the Background Art
Accessibility to great volumes of web information, i.e. information described in mark-up languages such as HTML (HyperText Markup Language), becomes possible through the World Wide Web, i.e. the Internet, because of the development of information technology and the popularity of information communication equipment.
However, in contrast with the huge amount of information, it becomes difficult to search for necessary information. A number of search engines are available on the Internet. These search engines include not only general-purpose ones but also specialized ones for use in searching for information in particular fields such as job information.
When a search engine is implemented, it is necessary to build a crawler that automatically accesses the Web and collects documents therefrom, a morphologic analyzer that performs morphologic analysis of a specific language, such as Japanese, and so forth, an index generator that generates indices for enabling retrieval of necessary information from documents as collected, and other units for performing other necessary processes.
In this connection, U.S. patent application publication No. US 2005/0071766 A1 to Brill et al., discloses systems and methods for obtaining information from a networked system utilizing a distributed web crawler. The distributed nature of clients of a server is leveraged to provide fast and accurate web crawling data. Information collected by a server's web crawler is compared to data retrieved by clients of the server to update the crawler's data. In one instance of this prior art technique, data comparison is achieved by utilizing information disseminated via a search engine results page. In another instance of this prior art technique, data validation is accomplished by client dictionaries, emanating from a server, which summarize web crawler data. This prior art technique also facilitates data analysis by providing means to resist spoofing of a web crawler to increase data accuracy.
A web crawler or spider is a program that accesses the Web in a methodical, automated manner, and collects content.
In the case of the prior art technique as described in Brill et al., the web crawler continues accessing the server from which content is collected until the collection of content is completed, and accesses with several and parallel connection on the same time, so that a certain amount of the bandwidth of the network is consumed.
However, if the network bandwidth is consumed by the crawler process, the network bandwidth available for providing the service of the server may become deficient. Particularly, for well-trafficked servers, it may substantially affect the quality of service if the available network bandwidth becomes deficient. The crawling process has therefore not to cause communication delay or congestion.
Because of this, there is desired a network communications traffic control method which can reduce the consumption of the network bandwidth.