WWW robots, also called Web Wanderers, Web Crawlers or Web Spiders, and often just referred to as bots (bot is short for robot), are programs devised to automatically traverse the hypertext structure of the Web. Such bots, having retrieved a document, can also recursively retrieve all the linked pages referenced in the document. This is especially the case of numerous search engines and their robots which roam the World Wide Web finding and indexing content to add to their databases. Although most robots provide a valuable service, concern has developed amongst Web site administrators about exactly how much of their precious server time and bandwidth is being used to service requests from these engines.
While the majority of robots are well designed, are professionally operated and cause no problems, there are occasions where robots visiting Web servers are not welcome because of the way robots behave. Some may swamp servers with rapid-fire requests, or retrieve the same files repeatedly. If done intentionally this is a form of Denial of Service (DoS) attack, although this is more often just the result of a poor or defective robot design. In other situations robots traverse parts of WWW servers that are not suitable for being searched e.g., contain duplicated or temporary information, include large documents or e.g., CGI scripts (CGI is a standard for running external programs from a World-Wide Web HTTP server). In this latter case and in similar situations, when accessed and executed, scripts tend to consume significant server resources in generating dynamic pages and thus, slow down the system.
In recognition of these problems many Web robots offer facilities for Web site administrators and content providers to limit what the robot is allowed to do. Two mechanisms are provided. One is referred to as the ‘Robots Exclusion Protocol’, even though it is not really an enforced protocol, but was a working draft document discussed as an Internet-Draft by the Internet Engineering Task Force (IETF) in 1996 under the title ‘A Method for Web Robots Control’. According to this document, a Web site administrator can indicate which parts of the site should not be visited by a robot. This is accomplished by providing a specially formatted file, in http:// . . . /robots.txt. The second mechanism assumes that a Web author can indicate whether a page may or may not be indexed, or analyzed for links, through the use of a special Hyper Text Markup Language (HTML) META tag i.e., a ‘Robots META tag’. However, both of these mechanisms rely on cooperation from the robots, and are not even guaranteed to work for every robot. Moreover, as already suggested here above relative to Dos attacks, some of these robots may not be so friendly. They could be run e.g., with the malicious intent of attacking a Web site (then, they just ignore the robots.txt file and the robots meta tags) so the site becomes overloaded and starts refusing to serve legitimate users i.e., the human beings trying to make normal use of the site.
Also, although the information made available on a site may not be confidential, an administrator may want to prevent the unlimited dissemination of it that would otherwise result from the indexing and referencing activities of all sorts of robots. The standard way of achieving this is to protect a Web site through some form of authentication, of which the more common method is to manage a list of registered users having a password so as they have to sign on upon accessing the site. The obvious drawback of this is that administrators must manage and update a closed list of users. This requires a registration step for a first consultation of a site and also assumes that users will remember their passwords in subsequent consultations. This may not be at all what the administrator wanted to achieve, and may even be counterproductive, since it will certainly discourage some individuals who are willing to browse a site to go further if they are requested to register.