In the World Wide Web (WWW) environment, client machines communicate with Web servers using the Hypertext Transfer Protocol (HTTP). The web servers provide users with access to files such as text, graphics, images, sound, video, etc., using a standard page description language known as Hypertext Markup Language (HTML). HTML provides basic document formatting and allows a developer to specify connections known as hyperlinks to other servers and files. In the Internet paradigm, a network path to a server is identified by a Uniform Resource Locator (URL) having a special syntax for defining a network connection. So called web browsers, for example, Netscape Navigator (Netscape Navigator is a registered trademark of Netscape Communications Corporation) or Microsoft Internet Explorer, which are applications running on a client machine, enable users to access information by specification of a link via the URL and to navigate between different HTML (web) pages.
When the user of the web browser selects a link, the client issues a request to a naming service to map a hostname (in the URL) to a particular network IP (Internet Protocol) address at which the server is located. The naming service returns an IP address that can respond to the request. Using the IP address, the web browser establishes a connection to a server. If the server is available, it returns a web page. To facilitate further navigation within a web site, a web page typically includes one or more hypertext references known as “anchors” or “links”.
Today, there exists a vast amount of web pages whereby information within the web pages is dynamic, decentralised and diverse. For a user, the task of traversing the information can be very difficult and time-consuming. Therefore, there is a need for an efficient and automated method of traversing this information, so that a user is able to find relevant information amongst the vast amount of pages that exist.
A “robot” is a type of “agent” that is one solution to this problem. An agent is a computer program that is goal-oriented, that is, an agent tries to achieve some end result. For example, an agent could perform a task on behalf of a user and this is shown in FIG. 1, by using the example of the Internet. In FIG. 1, a user at a client computer (100) dispatches two agents via a controlling application program running on the client (100). “Agent 1” and “Agent 2” are dispatched over a network (110), which in this example, is the Internet. Since agents can be customised, the user can dispatch “Agent 1” to find a first piece of information held on a remote server (120), for example, the address of the nearest pizza restaurant. The user can also dispatch “Agent 2” to find a second piece of information, for example, the phone number of a taxi firm, which in this example is also held on the same remote server (120).
A robot is a special automated form of agent. The robot may simply react to changes in its environment, or when subjected to stimuli. “Web” robots are widely used for search and extraction of information held in web pages. They also have other uses, such as for personal shopping, whereby the robot collects information about products and prices from the WWW and presents this to the user. Robots can also be utilised in other mediums, such as, in databases.
Information gathering robots, typically used to retrieve unstructured information, such as text or images, are also known as “spiders”, “crawlers” or “wanderers”. These types of robots are most often used in highly interconnected data environments, such as the WWW. The term “crawling” is often used to denote the process of moving through an environment in a managed way. Specifically, an information gathering robot is a program that automatically explores the WWW by retrieving a document and recursively retrieving some or all of the documents that are linked to it. The robot has thus generated a web index of documents.
There are two main categories of crawling, namely, unfocussed and focussed. In unfocussed crawling, the robot is not looking for anything in particular and its main aim is to gather as much information as possible. This technique is often used by a “search engine”, which searches through a web index in order to help locate information by keyword for example. Focussed crawling indicates that the robot is looking for a particular piece of information. This technique is used by a specialised robot such as a shopping robot.
More information about agents and web robots can be found in the book “Internet Agents: Spiders, Wanderers, Brokers and Bots” by Fah-Chun Cheong, New Riders Publishing, 1996.
Many robots are used for legitimate reasons, such as, for searching. Robots are often developed by well-known organisations, for example, search engine technology from Yahoo, Lycos, Google and so forth. However, when the first robots were developed, they had a reputation for sending hundreds or thousands of requests to each web site when gathering documents and this often resulted in the web site being overloaded. Although the development of robots has improved, some robots may still exhibit unfriendly behaviour, and it is this type of behaviour that an administrator may not be willing to tolerate.
Another reason for an administrator to want to block access to robots is to prevent them from indexing dynamic information. Using the example of searching again, many search engines will use information collected from a web site repeatedly, for weeks or months to come. Obviously, this feature is not much use if the web site is providing stock quotes, news, weather reports or any other information that will be out of date by the time a user finds it via a search engine. Other malicious robots are routinely used to systematically copy content assets from public web sites.
Currently, there are a number of methods of excluding robots from web sites. One example is the “Standard for Robot Exclusion” proposed by Martijn Koster and available at http://www.robotstxt.org/wc/wxclusion-admin.html. The protocol specifies a format for a file “Robots.txt”, located in a web server's root directory. This file provides a means to request that a named robot limits its activities at a particular web site, or requests that a robot leave a web site. In FIG. 2, the first line in the robots.txt file (200) identifies that the exclusion policies refer to a robot called “Robot—1”. The second line of the file (200) specifies that Robot—1 should not visit any URLs where “/england/london” is present after the host name in the URL, where a host name may take the form “www.corp.com”. In the third line, the robot is also excluded from visiting any URLs where “/france/paris” is present after a host name.
However, the disadvantage with the Standard is that the exclusion policies may or may not be obeyed. This is because, although a robot may review the robots.txt file, it is the decision of the robot's creator as to whether or not the file is obeyed. In the case of malicious robots, the Standard is often ignored or misinterpreted, resulting in web sites being adversely affected by the actions of uncontrolled robots. If this occurs, a major challenge for administrators is to identify malicious robots and put in place manual methods for explicitly dealing with them promptly and effectively.
Some robots may be relatively simple to detect, since their activity may be concentrated into a short time period. Alternatively the robot may manifest itself as a form of “denial-of-service” or “ping attack”. In this case a server is repeatedly hit by requests therefore limiting its capability to respond effectively. However, other robots use techniques so that they cannot be detected easily. One example is by hiding amongst the “noise” of traffic created by legitimate users of the system. Another example is by taking hours to complete a navigation of a system. In these cases, the manual and explicit exclusion of robots is difficult and unreliable.
Another method of controlling robots, or spiders in the case of this method, can be found at http://www.spiderhunter.com. The method described at this web site uses data collected when a user visits a web site, rather than using analysis of log files. To collect data, the method utilises three pieces of information, namely, an IP address associated with the user, the name of the spider being used and the file being requested. The method uses a neural net to check for new information and compares the new information against known information. For example, an IP address of a potential spider is checked to see whether it matches a known IP address of a spider. The neural net uses a baseline to determine whether the user is legitimate and uses weights to determine the likelihood of the user being a spider.
There are many disadvantages with using a neural net for detection of robots. For example, the output results from this method will only be as accurate as the amount of information input into it. Also, an administrator will not be able to modify the underlying detection method to suit their needs, rather, only the weights can be modified. This particular method also relies on the fact that a potential spider provides an IP address, however, if a spider enters a site through multiple proxies, it may be able to hide its IP address. Another web site offering a similar service is “Spider Central” which can be found at http://www.john.php4hosting.com.
Therefore there is a need for a method of automatically detecting and managing malicious robots, so that administrators can control access to their web sites, servers and systems more effectively.