This invention relates generally to methods for limiting access of client computers over a computer network to data accessed through a server machine. More particularly, it relates to methods for monitoring client requests and denying access to clients whose requests significantly reduce server performance, or who are attempting to obtain excessively large portions of server resources.
The popularization of the Internet is changing the ways in which information is typically distributed. Rather than using a limited number of print publications, such as books or magazines, or gaining access to libraries, a person can obtain a great deal of information by accessing a Web server using a browser on a client computer.
Specialized Web sites exist that share large databases with the general public. For example, the U.S. Patent and Trademark Office (www.uspto.gov) provides a searchable full-text patent database containing all U.S. patents issued since 1976. Similarly, IBM hosts a Java(trademark) Web site (www.ibm.com/java) through which developers access technical articles and case studies, and download code segments and other tools. Gourmet(copyright) and Bon Appetit(copyright) magazines jointly produce the Epicurious(copyright) (www.epicurious.com) Web site, which contains an enormous recipe database. Each of these sites allows users at client browsers to enter particular search queries, for example, patent classifications, code segment titles, or recipe ingredients. In response, the Web server provides the user with a set of matching Web pages. Each individual web page result can also be accessed directly using its Universal Resource Locator (URL).
Most Web servers track the number of times their sites are accessed, termed xe2x80x9chitsxe2x80x9d; popular Web sites receive thousands of hits in a single day. When a request is made to a server (a GET message), the request is logged in a log file. Log files are not standardized, but generally contain a timestamp, an identifier for,the client, and a request string. Web sites can then use the number of hits to attract advertisers to their site, offsetting their maintenance costs and allowing them to continue to provide unlimited and free access.
In addition to individual users, Web servers are also accessed heavily by robots, programs. that automatically traverse the Web to create an index. Robots, also known as spiders or webcrawlers, retrieve a document and then retrieve all the linked documents contained with the initial retrieved document, rapidly spreading throughout the Web. They may also systematically march through every document on a server. Robots are most commonly, but not exclusively, used by search engines. One robot (ImageLock) records every single image it encounters to determine possible Copyright infringers. Robots are not inherently destructive, but they can cause two significant problems for a Web server, both of which are referred to as xe2x80x9covercrawling.xe2x80x9d First, if they request documents too frequently, they may significantly reduce a server""s performance. Second, it is possible (although often a violation of copyright law) to systematically download an entire Web site information repository using a robot, and then publish the information elsewhere.
Currently, these problems are addressed manually. If a system administrator notices a significant performance decrease, he or she can examine the log files to determine the source of the problem. If one robot is causing the problem, it can be excluded using the Robot Exclusion Standard: the system administrator creates a structured text file called/robots.txt that indicates parts of the server that are off-limits to specific robots. In general, robots read the file before making a request, and do not, request files from which they are excluded. However, even if a robot does not follow the standard, it is possible to exclude it if its Internet Protocol (IP) address is known.
Manual patrolling of log files is quite time-consuming for the system administrator, especially as a Web site""s hit count grows. Because it cannot be done in real-time, a crawler is blocked only after it has slowed down site performance dramatically, or after it has downloaded significant amounts of server resources.
A standard method for automatically limiting access to data is through the use of a firewall. A firewall is set of related programs that protect the resources of a private network by regulating access of outsiders to the network (and often also by regulating access of insiders to the Internet). Firewalls may allow outside access only to users with specific IP addresses or passwords, or may provide alarms when network security is being breached. However, they are generally not designed for protecting the resources of servers that provide information to the general public.
A variety of systems have been developed to monitor access of clients to server data. Two broad categories are found: those for clients who have previously registered to access a server, and who provide an identification that must be authorized; and systems for analyzing client activity to develop statistical data and client profiles, which can be used for marketing or advertising purposes. Both types of monitoring systems may also include features to determine if there is excessive traffic that will crash the server. Examples of the first category include U.S. Pat. No. No. 5,553,239, issued to Heath et al., which discloses a system and apparatus for monitoring a client""s activity level during connection to a server; and U.S. Pat. No. 5,708,780 to Levergood et al., which provides a system for monitoring the requests an authorized client makes to a server. These systems cannot be used to address the current problem, which occurs in publicly accessible servers.
In the second category is U.S. Pat. No. 5,796,952, issued to Davis et al. In this system, a client profile is developed based on client requests and time spent using each requested file. A server stores information on the amount of data downloaded and the choices the client has made. Based on the data analysis, specific advertising can be sent to the client. This system does not address the problems detailed above, and is mainly concerned with the user""s behavior after the requested file is sent to the client machine.
Real-time log file analysis is commonly performed; commercial software packages are available and can be tailored to suit a Web server""s specific needs. These software packages maintain and analyze log files to create reports of demographics, purchasing habits, average time per visitor, and other information. In U.S. Pat. No. 5,787,253 to McCreery et al, an internet activity analyzer is disclosed. The analyzer provides source and destination information and indications of internet usage. It also detects potential server problems so that users may be notified. A real-time log file analyzer is also provided by U.S. Pat. No. 5,892,917, issued to Myerson. This analyzer creates supplemental log records for cached files that were likely used to satisfy user requests, in order to create a more accurate profile of user activity. None of the prior art log file analyzers use the gathered information to dynamically determine whether crawlers are abusing their access, either by excessively frequent requests or by downloading excessive portions of the server database, and none can dynamically decide to refuse access.
An additional problem, not addressed by the prior art, is that there is not always a one-to-one correlation between robots and IP addresses, or other client identifiers. For example, in many corporations, users access the internet through a gateway server. All of the users then have the same IP address, and may appear in a log file as a single user. Conversely, a robot might deceptively use multiple. IP addresses to systematically download Web site information without being detected.
There is a need, therefore, for a method for dynamically limiting robot access to server data as requests are being made.
Accordingly, it is a primary object of the present invention to provide a system and method for dynamically blocking access of abusive robots to server resources.
It is an additional object of the invention to provide a method that dynamically blocks a client from accessing a server if it has made too many requests.
It is another object of the present invention to provide a method that dynamically blocks a client from accessing a server if it is attempting to download a significant portion of the server""s database.
It is a further object of the present invention to determine whether excessive requests from a single client identifier are from a gateway server and represent legitimate requests from multiple users.
It is an additional object to track overcrawling from different client identifiers that represent one robot.
These objects and advantages are attained by a method for limiting access of a client computer to data objects accessible through a server computer in a distributed computer network. Preferably, the distributed computer network is the Internet, and the data object is a Web page. The method is implemented in the server and automatically recognizes when a client computer is making requests too frequently or is accessing too much of the server computer""s resources. The quantitative definitions of xe2x80x9ctoo frequentlyxe2x80x9d and xe2x80x9ctoo muchxe2x80x9d are selected by a system administrator or equivalent to accommodate the needs and limitations of the particular server. The method can detect three types of clients: a single client making too frequent requests and accessing too much of the server resources; a group of clients in a subnet mask, in which the group requests too frequently for a single client, but does not access too much of the server data; and a single entity operating from multiple client computers but accessing too much of the server resources.
The method has four steps: receiving a request for a data object from a client computer, recording a log entry for the request in a log file, calculating client request values associated with the client identifier from the log entry and from previous log entries, and refusing to send the requested data object if at least one of the client request values exceeds one of a set of corresponding predefined maximum request values. Preferably, the server sends a refusal message to the client computer over the distributed computer network when the request is refused.
The log entry comprises a client identifier, preferably an IP address, and a timestamp for the request. In one embodiment, the calculated client request values include a request frequency for that client, calculated from the current log entry and from previous log entries associated with the same client identifier. The set of corresponding predefined maximum request values include a maximum request frequency, and the client""s request frequency is compared with the maximum request frequency to determine whether the, client should be refused access. The maximum request frequency is defined as a number of requests x1 in a time period t1. Preferably, the predefined maximum request values also include at least one additional. maximum request frequency: x2 requests in a time period t2, where x1 is not equal to x2 and t1 is not equal to t2. Multiple, independently selectable maximum request frequencies help detect irregular patterns the robot may use to escape detection.
In a second embodiment, the log entry also includes at least one data object identifier, which may be a Universal Resource Locator (URL) for the data object. Alternately, the method includes an additional step of processing the request to generate a result set containing at least one result data object. In this case, the data object identifier corresponds to the result data object, and the request must be processed before the log entry can be completed. In this embodiment, the client request values include a cumulative data request, a measure of how much of the server resources the client has already requested and received in the past. The set of corresponding predefined maximum request values includes a data access threshold, the maximum amount or fraction of data the client may receive. If the client""s cumulative data request exceeds the data access threshold, the client request is refused. Either embodiment (frequency or data threshold) may be used separately, or both may be used together, and the client may by refused access if any one of the client request values exceeds the corresponding predefined maximum request values, or only if all of them do.
Alternately, the cumulative data request value may be for all previous requests, including those with different client identifiers, not just clients having a single client identifier. However, only the current request is refused.
The invention also provides a method having additional steps of comparing the client identifier with a deny list including denied client identifiers and refusing to send the requested data object when the client identifier is on the deny list. If one or all of the client request values exceeds the corresponding predefined maximum request value, the client identifier is added to a dynamically-generated deny list. In an alternate embodiment, if the client identifier is on an exception list, the client identifier cannot be added to the deny list, even if the request values are too high.
Finally, the invention provides a data protection system associated with the server. The system includes a log file described above, a request analyzer, and a dynamically-generated deny list. The request analyzer calculates the request values and compares them with the corresponding predefined maximum request values to generate failed client identifiers. The failed client identifiers are added to the deny list. When the server receives a new request from a known client, it refuses the request if the known client has a client identifier matching one of the failed client identifiers. In a preferred embodiment, the system also contains means for removing a specific failed client identifier from the deny list.