1. Field of the Invention
The present invention relates to software rejuvenation, and more particularly to transparent symptom-based selective software rejuvenation.
2. Description of the Related Art
The software executing in computer and networking systems can exhibit a behavior such that its failure rate increases over time, typically because of programming errors that generate increasing and unbounded resource consumption, or due to data corruption and numerical error accumulation (e.g., round-off errors). Examples of the effects of such errors are memory leaks, file systems that fill up over time, and spawned threads or processes that are never terminated.
The above-mentioned effects constitute a phenomenon known as xe2x80x9csoftware aging,xe2x80x9d and may be caused by errors in either application, middleware, or operating system software. As the allocation of a system""s resources gradually approaches a critical level, the probability that the system will suffer an outage increases. This may be viewed as an increase in the software system""s failure rate. Such a software system failure may result in overall system failure, xe2x80x9ccrashingxe2x80x9d, xe2x80x9changingxe2x80x9d, performance degradation, etc.
One way of reducing the software""s failure rate is to reset a portion of the system to recover any lost and unused resources. For example, this may be resetting just the application that is responsible for the aging, or it may be resetting the entire computer system. This is referred to as xe2x80x9csoftware rejuvenation.xe2x80x9d When only a part of the system is selected for rejuvenation, this is called xe2x80x9cselective rejuvenation.xe2x80x9d FIG. 1 is a diagram showing the relationship of the software failure rate over time. As shown software ages over time as shown by reference numeral 100, and the effects of software rejuvenations are shown at reference numeral 110.
When the part of the system that is undergoing aging is reinitialized via rejuvenation, its failure rate falls back to its initial (e.g., lower), level because resources have been freed up and/or the effects of numerical errors have been removed. This has a dramatic effect on overall system availability. However, when the failure rate begins to climb again due to the above-mentioned causes, subsequent rejuvenations become necessary.
The transparent nature of this function is partly achieved by being incorporated into a management framework separate and independent of the operating system and application. This approach not only assists in providing transparency, but also allows for portability and interoperability across multiple operating systems. This is important as it is becoming more common for systems management to span multiple platforms in the enterprise.
It should be noted that error-induced resource exhaustion is a fundamentally different phenomenon from performance-based capacity exhaustion. xe2x80x9cError-induced resource exhaustionxe2x80x9d is due to the causes outlined above and its effects can be eliminated temporarily by rejuvenation. xe2x80x9cPerformance-based capacity exhaustionxe2x80x9d results from increases in performance requirements over time, and its effects can only be eliminated by reducing the computational performance requirements or adding new resources to the system.
In a conventional method and system, a time-based rejuvenation is performed within a single node and a multi-node environment. However, the conventional method and system assumes that the application must be modified to include calls to a fault tolerance library. Thus, modification of the application software source code is required. This is problematic because if the application source or executable code must be modified, then software rejuvenation technology cannot be applied to the broad class of applications for which the source or executable code cannot be economically modified, such as xe2x80x9cshrink-wrappedxe2x80x9d applications. Transparency, e.g., the lack of necessity to modify the application source or executable code, allows software rejuvenation to be applied to any application, regardless of whether the code can be modified or not. This extends its applicability to essentially all applications that can be executed on the computer system.
Additionally, the conventional method typically requires a proprietary dedicated system support for the rejuvenation functionality. Thus, the conventional methods and systems significantly restrict the rejuvenation""s potential applicability and ease of use to systems that support the proprietary rejuvenation functionality.
In the above-mentioned co-pending application, it was described how to periodically rejuvenate all or part of a software system to reduce its failure rate to its initial, lower level, based on time.
However, in the co-pending application, there is no mechanism to notify the system operator to only rejuvenate prior to an unplanned outage, as opposed to at statically predetermined intervals.
Further, some system outages are characterized by not just one parameter reaching an extreme value, but by several values reaching a characteristic region in which not all values may be global extreme. In these cases, single-parameter extremes monitoring is ineffective.
Further, hitherto the invention, there has been a problem of predicting impending resource exhaustion. Some methods monitor a small number of key resource parameters of the operating system (e.g., the UNIX operating system), performing trend detection and slope estimation for those parameters, and projecting the mean time to their exhaustion based on the extrapolated trend. However, this technique has been limited to confirming the existence of software aging due to resource exhaustion. In this approach, there has been no successful prediction of outage since only individual indicators of outage are examined independently. There has been no approach of examining several indicators in combination to form a predictor of outage, even though none is necessarily at a global extreme.
Finally, the conventional methods have been unable to pinpoint and identify which subsystem, process, and thread are causing the exhaustion, and selectively rejuvenate solely the offending function.
In view of the foregoing and other problems, disadvantages, and drawbacks of the conventional methods and structures, an object of the present invention is to provide a method and system for predicting outages of a software system.
Another object is to provide a method and system for providing non-intrusive symptom-based prediction (and subsequent rejuvenation) of unplanned software outages.
In a first aspect of the invention, a method (and system) for increased software dependability, includes learning how to predict an outage of a software system running on a computer, based on the learning, predicting an imminent outage, and avoiding the outage.
In a second aspect of the invention, a method for software rejuvenation, includes waiting for symptoms associated with an outage of the software, determining whether a fail-to node has adequate resources to accept a failover workload, if the determining determines that the fail-to node cannot accept the failover workload, sending an alert that adequate redundancy does not exist to support fault tolerance requirements, and suspending rejuvenation until an operator acknowledges and corrects the deficiency.
In another aspect of the invention, a signal-bearing medium (e.g., storage medium) is provided for storing a program representing the inventive method.
Thus, the present invention provides symptom-based rejuvenation in several environments (e.g., the preferred embodiment being within a clustered environment). In the invention, the application need not be modified or to include calls to a fault tolerance library. Indeed, no such modification is necessary, and applications obtain the benefits of software rejuvenation without any modification. In addition, no proprietary, dedicated system support is required for the rejuvenation functionality. The use of rejuvenation by the invention is performed within the context of any industry-standard clustering environment. These enhancements significantly expand rejuvenation""s potential applicability and ease of use.
Further, the present invention can predict impending resource exhaustion and aging due to resource exhaustion by incorporating the effects of variance on an extrapolated trend, incorporating time integral tests for secondary indicators, and including increased degrees of variance as symptomatic of outages. The multivariate approach of the present invention is especially effective when several indicators combine to form a predictor of outage, where no one indicator is necessarily at a global extremum. Finally, the provided indicators are used to identify which subsystem, process, and thread are causing the exhaustion.
Transparent Symptom-Based Selective Software Rejuvenation (SSR), according to the present invention, expands time-based rejuvenation by providing how to rejuvenate only prior to an unplanned outage, as opposed to rejuvenating periodically (e.g., at statically predetermined periods) as in the co-pending application.
Additionally, it allows the operator to identify exactly (e.g., pinpoint) which subsystem (and/or process and/or thread) is responsible for the exhaustion, opening up the possibility of only rejuvenating the offending subsystem. This is achieved by non-intrusively monitoring and analyzing the state of a software system so as to predict an impending resource exhaustion-induced outage.
Thus, for example, consider the possibilities of 1) a memory leak with no corruption to the rest of the system, 2) an error not contained and thereby the system has been contaminated, and 3) the intermediate case of a small amount of leakage occurring which over time may contaminate the entire system.
In the first case, only the thread or subsystem need be rejuvenated and the effects of the leak will have been eliminated. In the second case, the entire system would need to be rejuvenated to eliminate the effects of the leak. In the third case, the rejuvenation could be performed selectively at a time least disruptive to the system, with the operator knowing that over time more subsystems/threads would have to be rejuvenated; the earlier the rejuvenation, the less disruptive the rejuvenation would be, and the later the rejuvenation, the more of the system would have to be rejuvenated. Selection of the amount of time to rejuvenation in this case (and the resultant risk of an outage) could be at the discretion of the operator.
The present invention uses existing resource monitoring mechanisms provided by an operating system to non-intrusively monitor system resources such as file space, handles, threads, free virtual memory, and system memory. Several techniques, as described below, can then be used to aggregate these indicators into a reliable predictor of impending outage. When an analytical aggregate of these indicators approaches a region associated with increased likelihood of unplanned outage, the system operator can be notified and then can initiate a graceful planned outage instead of an ungraceful unplanned outage.
Symptom-based prediction of unplanned software outages can be beneficially combined with some concepts of time-based software rejuvenation, for example by suggesting to the operator that rejuvenation occur during the next acceptable interval defined under a time-based software rejuvenation management policy.
Additionally, when symptom-based software rejuvenation discovers that a resource exhaustion condition is developing, an additional function provided by the present invention is capable of determining which subsystem, process, and thread (e.g., rejuvenation can be performed at a granular level) is responsible for the exhaustion condition and selectively causing only that part of the system to be rejuvenated, assuming the underlying system has this capability.
With the unique and unobvious aspects of the present invention, unplanned outages due to resource exhaustion are avoided by non-intrusively monitoring the state of a computer system and, based on indicative symptoms, and by signaling an impending unplanned outage to the operator or cluster management software in time to perform a graceful rejuvenation. This benefit is provided transparently to the application software.
Rejuvenation can increase a system""s mean-time-to-outage by a factor of two or more (e.g., based on an analytical model), depending on the degree of aging. Once it has been set up, the rejuvenation process is completely automated and therefore not susceptible to human error, either in terms of forgetting to perform the rejuvenation, or in terms of errors in performing the rejuvenation itself.
Additionally, similarly to a time-based system as described in the co-pending application, rejuvenation can be scheduled to occur at a time of least system workload, when an operator may not be present. Selective rejuvenation allows the operator to only rejuvenate that part of the system that is the cause of the aging, further reducing any impact on system operation. Because transparent rejuvenation requires no modifications to the application software, it can be used for any application running on the system, without modification.
Because the present invention can use the fail-over capabilities of a clustered system, very little downtime is incurred to perform the rejuvenation. Finally, the invention provides an automatic and continual check on the system""s capability to tolerate an unplanned failure.