1. Field of the Invention
The present invention relates in general to computer systems and in particular to distributed data processing environments. More specifically, the present invention relates to a system and method for automatically rejuvenating software in a distributed data processing environment.
2. Description of Related Art
A collaborative data processing environment is a collection of two or more individual data processing systems that cooperate to perform one or more tasks by sharing resources (such as information). Relationships between some data processing systems may change rapidly, with the result that some collaborative data processing environments exist only briefly. For example, when a person utilizes a personal computer (PC) to retrieve data from a Web server, the PC and the Web server typically cooperate in the performance of that task only briefly before reallocating resources to other tasks, resulting in a short-lived collaborative data processing environment containing that PC and that Web server. On the other hand, the relationships between certain data processing systems may be relatively permanent, giving rise to more stable collaborative data processing environments.
One common type of collaborative data processing environment that is usually relatively stable is a distributed data processing environment. A distributed data processing environment is a collaborative data processing environment which includes two or more data processing systems that are both configured to perform at least a subset of common tasks on behalf of the collaborative data processing environment. When two or more data processing systems are configured and grouped in such a way that the group""s work can be processed by any one of the data processing systems, the data processing systems are said to be clustered. Among the benefits that may be realized from clustering are scalability, load balancing, and increased availability.
A common type of distributed data processing environment or data processing system cluster is the server cluster. In a server cluster, two or more data processing systems are configured to perform at least a subset of common server tasks, such as responding to requests for information from external data processing systems. Another universal characteristic of server clusters is that each server cluster is configured to interact with external data processing systems substantially as if the server cluster were a single server machine.
Server clusters are typically configured to distribute the workload of the server cluster among multiple server machines, thereby providing for better performance (e.g., increased reliability, processing power, and/or input/output (I/O) throughput) than can be obtained from one server machine in isolation. Web servers, for example, are frequently implemented as server clusters. A web server is a data processing system or a server cluster which has been assigned an internet protocol (IP) address and which contains server control logic (typically implemented as server software) that receives and processes requests addressed to that IP address from external data processing systems. Typically, a web server will service a client request by utilizing hypertext transfer protocol (HTTP) to transmit information to the originating client. The information provided by a web server can be in the form of programs which run locally on the client or in the form of data such as files that are used by other programs. When a web server is implemented as a server cluster, multiple server machines within the server cluster cooperate to service the client requests.
When operating as a Web server, a typical server cluster includes a dispatching component (i.e., a dispatcher) that dynamically monitors and balances application workload among individual servers in real time. Lightly loaded servers are preferentially given workloads over heavily loaded servers, in an attempt to keep all servers equally loaded, and prevent any servers from becoming overloaded. The main advantages of load balancing are that it allows heavily accessed Web sites to increase capacity, since multiple server machines can be dynamically added while retaining the abstraction of a single entity that appears in the network as a single logical server. In addition, failure of one or more of the server machines in a server cluster need not completely disable the operation of remainder of the server cluster. Additional detail regarding dispatcher operation is provided in the related application referenced above.
While distributed data processing environments such as server clusters provide important advantages, among the disadvantages associated with distributed data processing environments, relative to isolated data processing systems, are increased system configuration and maintenance requirements. That is, it is not sufficient to simply configure and maintain the hardware and software of a single machine. Rather, it is necessary to configure and maintain multiple machines, as well as the mechanisms that allow those machines to interact with external data processing systems as if the cluster were a single machine. Moreover, it is often desirable to keep distributed data processing environments, such as server clusters, operational continuously. For example, Web servers are often expected to be operational 24 hours a day, 7 days a week (24/7). Therefore, when such a Web server is implemented as a server cluster, reconfiguring or performing maintenance on one or more of the components (e.g., server machines) of the server cluster should be accomplished without disabling operations of the server cluster as a whole.
One problem that system maintenance alleviates or counteracts is a phenomenon known as software aging. Software aging is a common condition, wherein a data processing system""s probability of failure (i.e., failure rate) increases over time and/or the data processing system""s performance decreases over time, typically because of programming errors that generate increasing and unbounded resource consumption, or due to data corruption and numerical error accumulation (e.g., rounding errors). Examples of the effects of such errors are memory leaks, file systems that fill up over time, and spawned threads or processes that are never terminated. Software aging may be caused by errors in a program application, operating system software, or xe2x80x9cmiddlewarexe2x80x9d (software adapted to provide an interface between applications and an operating system). As the allocation of a system""s resources gradually approaches a critical level (i.e., as the system approaches resource exhaustion), the probability that the system will suffer an outage increases, and the system""s performance may decrease. Among the possible consequences of software aging are overall system failure, software application failure, hanging, performance degradation, etc.
One way to counteract software aging is to reset at least a portion of the system to recover any lost and unused resources. For example, this may be accomplished by resetting just the application that is responsible for the aging or by resetting the entire system (see, e.g., U.S. Pat. No. 5,715,386). These processes are known as partial software rejuvenation and complete software rejuvenation, respectively (or simply partial rejuvenation and complete rejuvenation). When the part of the system that is undergoing aging is reinitialized via rejuvenation, the system""s failure rate reverts back to its initial (i.e., lower) level because resources have been released and/or the effects of numerical errors have been removed, etc. However, when the failure rate begins to climb again due to the above-mentioned causes, subsequent rejuvenations become necessary. Nevertheless, software rejuvenation can dramatically lengthen a system""s time between failures.
However, it can be difficult to perform software rejuvenation in a distributed data processing environment without adversely affecting the performance of the distributed data processing environment, especially if the distributed data processing environment is expected to be operational 24/7. For example, in conventional server clusters, workload can be steered away from a faulty server, but only after that server has catastrophically failed. However, waiting for a component of a distributed data processing environment to fail before steering workload away from that component typically results in adverse consequences. For example, waiting for failure of a server in a server cluster before steering workload away from that server makes it necessary to process additional workload to recover from the failure. In particular, when a component fails unexpectedly, in addition to the cluster""s usual workload, the cluster must service additional requests, such as large temporary surge in session reconnection attempts, which may cause increased network traffic, dispatcher CPU utilization, and, in some cases, client reconnections. Such disruptive behavior is highly undesirable in a distributed data processing environment, particularly during times of high utilization of the data processing environment.
As recognized by the present invention, it would therefore be beneficial to devise a method of reducing or eliminating performance degradation, partial outages, and/or complete outages in a distributed data processing environment caused by effects such as software aging. It would be further advantageous if such a method could be implemented transparently to external data processing systems utilizing the distributed data processing environment. Yet additional advantages could be realized if the effects of software aging could be countered automatically and without noticeably reducing the performance of the distributed data processing environment while rejuvenation is being performed.
The present invention relates to a method of automatically rejuvenating a component of a distributed data processing environment while minimizing the disruptive effects of the rejuvenation. According to that method, a usage history for a distributed data processing environment is stored, the usage history describing multiple levels of overall usage of the distributed data processing environment over time. Also, health data relating to at least one component of the distributed data processing environment is received, and, in response, the health data is automatically utilized to determine a failure time within which that component is likely to require rejuvenation. In response to determining the failure time, the usage history is automatically utilized to identify an optimum rejuvenation time. In response to identifying the optimum rejuvenation time, that component is automatically rejuvenated according to the optimum rejuvenation time.
In an illustrative embodiment, the distributed data processing environment is a server cluster. Also, the optimum rejuvenation time is identified by ascertaining a minimum level of overall usage of the server cluster within the remaining time to the predicted failure. The usage time that corresponds to the minimum level is utilized as the optimum rejuvenation time.
All objects, features, and advantages of the present invention will become apparent in the following detailed written description.