Web crawlers are software programs that automatically download and extract information from the World Wide Web. The crawler selectively accesses the webpage and relevant links in the World Wide Web to obtain required information according to an established objective. The crawler is often used for data collection from the network and data upload to the database of search engines. Some crawlers, however, gather specific types of information on webpages, such as email addresses, for malicious purposes, such as sending spam. Sometimes crawlers also harvest useful contents from websites and misappropriate such information without obtaining permission from the creators of the information. Thus, some webpage content should be protected from web crawlers.
Existing anti-crawl techniques typically involve setting a maximum number of access requests for a single IP user in a unit of time, tracking the requests of every single IP user when the user accesses the website and recording the number of requests to the website by the user in a unit of time. The system determines whether the number of requests to the website by a user in a unit of time recorded by the system exceeds the maximum number of access requests of a single IP user in a unit of time set by the website. If the maximum number is not exceeded, the user requests are accepted; otherwise, it is determined that the requests are crawling requests by crawlers and the requests of that user are refused. Other actions such sending a notification to the user or shielding the IP directly may be performed.
Existing anti-crawl techniques can lead to poor experiences for users who make frequent requests to the website, since the requests may be deemed as malicious crawling and be refused. Moreover, the crawler can also deceive the inspection of the website server by forging its IP address in order to crawl the information on targeted websites.