This disclosure relates to the field of computers. More particularly, a system, method, and apparatus are provided for resisting the scraping of data from a website or other electronic data repository.
Many websites make information available to visitors without requiring them to login or otherwise verify or authenticate their identities. For example, even a website that requires a member or use to login in order to access some information may make other information available to anonymous and first-time visitors, as well as users who don't login. Such publicly available information may include directory pages, basic information about an organization associated with the website, public profiles of some or all members, etc.
In general, website scraping allows the scraper to quickly assemble large amounts of data (which may be proprietary) without much effort. Although in some cases scraping of a website may be relatively harmless, in other cases malicious actors may scrape a website or data repository to obtain information (e.g., names, electronic mail addresses, telephone numbers) that can be used for undesirable purposes (e.g., identity theft, spamming), and/or the scraping activity itself may interfere with normal (i.e., non-malicious) use of the website, because it consumes resources that were intended for legitimate use of the site, not scraping.
Sometimes it can be difficult to determine that a given user session is a scraping attempt, especially if the user/scraper masks its activity to avoid acting like an automated entity (e.g., a bot). For example, a website may capture an IP (Internet Protocol) address of an entity that scrapes the site, but because IP addresses can be spoofed and because multiple people may use a single address, a later connection from the same address may represent another scraping attempt or a legitimate user connection.