As both usage of the Internet and the number of web pages on the Internet grows, there is an increasing need to provide relevant information. A general web crawler is often used to find information for presentation to users. A general web crawler typically browses the Internet for the purpose of indexing. Further, they are often utilized with web scrapers to copy pages for later processing by a search engine.
A general web crawler starts with a list of universal resource locators (seed URLs) to visit. As the general web crawler visits the seed URLs, the crawler identifies all hyperlinks in the page and adds them to a list of hyperlinks (e.g., a list of URLs) to visit. Web scrapers typically scrape the pages that the general web crawler visits.
Unfortunately, general web crawlers may disproportionately utilize web site resources when compared to normal traffic. For example, crawling all URLs on a web page and scraping the crawled pages may require significant resources from one or more hosting web servers. As the number of general web crawlers increase, resource requirements to service the demand will likely impact a hosting web server's ability to provide service to individual users.
Further, many hosting web servers value the information that is provided on the hosted web pages and may wish to guard against excessive scraping of that information. For example, many web sites generate advertisement revenue by encouraging users to visit their sites by providing aggregate valuable information (e.g., reviews). As a result, operators of hosted web servers may limit scraping of information from their web pages.
Techniques that operators of hosted web servers may utilize to limit scraping of information include, for example, rate limiting, monthly limits, and total limits. A web server that utilizes rate limiting limits the number of times a site or set of web pages is visited by a particular IP address or particular machine over a short period of time (e.g., over a second or minute). A web server that utilizes monthly limits utilizes a process that is similar to rate limiting but over a longer period of time (e.g., over a month). For example, a web server may utilize rate limiting to eliminate spikes of requests from a particular device over a short period of time. A web server may utilize monthly limits (or limits over a predetermined period of time not necessarily monthly) to eliminate a volume of visits that may fall below rate limiting but that indicate that the behavior is not consumer or customer behavior. A web server that utilizes total limits utilizes a process that limits the total visits over any period of time.
Another technique an operator of a hosted web server may utilize to identify and blacklist visitors is the use of honeypots. In one example, a honeypot is a link that may not be viewable from a web page (e.g., there is a link encoded in the page that has no width or otherwise is not displayable on the web page or a link that is dynamically changing due to javascripting). Since a general web crawler typically scans a web page's code for links, the general web crawler may crawl the honeypot link thereby allowing the operator of the hosted web server to identify and blacklist the general web crawler.
As a result of these techniques and others, general web crawlers are often limited in their ability to acquire information.