Web scraping generally includes activities to extract data or content from a website through manual or automated processes. The extracted data may be used in various ways, including indexing the website to facilitate search, using the extracted data to run a separate website or to power a separate application, etc. In some cases, the data may be sold to third parties or used by a competitor for analysis, often without attribution to the originator.
While some friendly web scraping activities are welcome, some web scraping activities are damaging to the website. For example, a search engine may use an automated software tool, called a bot, to automatically visit various web pages of the website to index the web pages. When a user searches the web using the search engine, the index information can be used to determine whether there is a match between the web pages and the user search request. The search result of the search engine can direct the user to the web pages if the web pages match the search requests. Since the search engine is helpful in driving web traffic to the website, the web scraping activities by the search engine are generally welcome.
However, a scraper may use the extracted data to set up a scraper site, which serves its users using the data extracted through web scraping without referring the users to the original website. This or any other unauthorized use of the data by a web scraper is generally not welcome.
Web scraping may also overload the website, causing degradation in response performance for regular users of the website.
There are some techniques to stop or slow a bot. For example, if known, the IP address of the bot can be blocked to prevent further access by the bot. For example, bots may be blocked using tools that automatically determine whether there is real person behind the request, such as “Completely Automated Public Turing test to tell Computers and Humans Apart” (CAPTCHA) tests.