1. Field of the Invention
The present invention relates to software rejuvenation, and more particularly to transparent time-based selective software rejuvenation.
2. Description of the Related Art
The software executing in computer and networking systems can exhibit a behavior such that its failure rate increases over time, typically because of programming errors that generate increasing and unbounded resource consumption, or due to data corruption and numerical error accumulation (e.g., round-off errors). Examples of the effects of such errors are memory leaks, file systems that fill up over time, and spawned threads or processes that are never terminated.
The above-mentioned effects constitute a phenomenon known as xe2x80x9csoftware aging,xe2x80x9d and may be caused by errors in either application, middleware, or operating system software. As the allocation of a system""s resources gradually approaches a critical level, the probability that the system will suffer an outage increases. This may be viewed as an increase in the software system""s failure rate. Such a software system failure may result in overall system failure, xe2x80x9ccrashingxe2x80x9d, xe2x80x9changingxe2x80x9d, performance degradation, etc.
One way of reducing the software""s failure rate is to reset a portion of the system to recover any lost and unused resources. For example, this may be resetting just the application that is responsible for the aging, or it may be resetting the entire computer system. This is referred to as xe2x80x9csoftware rejuvenation.xe2x80x9d When only a part of the system is selected for rejuvenation, this is called xe2x80x9cselective rejuvenation.xe2x80x9d FIG. 1 is a diagram showing the relationship of the software failure rate over time. As shown software ages over time as shown by reference numeral 100, and the effects of software rejuvenations are shown at reference numeral 110.
When the part of the system that is undergoing aging is reinitialized via rejuvenation, its failure rate falls back to its initial (e.g., lower), level because resources have been freed up and/or the effects of numerical errors have been removed. This has a dramatic effect on overall system availability. However, when the failure rate begins to climb again due to the above-mentioned causes, subsequent rejuvenations become necessary.
The transparent nature of this function is partly achieved by being incorporated into a management framework separate and independent of the operating system and application. This approach not only assists in providing transparency, but also allows for portability and interoperability across multiple operating systems. This is important as it is becoming more common for systems management to span multiple platforms in the enterprise.
It should be noted that error-induced resource exhaustion is a fundamentally different phenomenon from performance-based capacity exhaustion. xe2x80x9cError-induced resource exhaustionxe2x80x9d is due to the causes mentioned above and its effects can be eliminated temporarily by rejuvenation. xe2x80x9cPerformance-based capacity exhaustionxe2x80x9d results from increases in performance requirements over time, and its effects can only be eliminated by reducing the computational performance requirements or adding new resources to the system.
In a conventional method and system, a time-based rejuvenation is performed within a single node and a multi-node environment. However, the conventional method and system assumes that the application must be modified to include calls to a fault tolerance library. Thus, modification of the application software source code is required. This is problematic because if the application source or executable code must be modified, then software rejuvenation technology cannot be applied to the broad class of applications for which the source or executable code cannot be economically modified, such as xe2x80x9cshrink-wrappedxe2x80x9d applications. Transparency (e.g., the lack of necessity to modify the application source or executable code) allows software rejuvenation to be applied to any application, regardless of whether the code can be modified or not. This extends its applicability to essentially all applications that can be executed on the computer system.
Additionally, the conventional method typically requires a proprietary dedicated system support for the rejuvenation functionality. Thus, the conventional methods and systems significantly restrict the rejuvenation""s potential applicability and ease of use.
In view of the foregoing and other problems, disadvantages, and drawbacks of the conventional methods and structures, an object of the present invention is to provide a method and system for rejuvenating a software system.
Another object is to provide a system and method for rejuvenating a software system such that the rejuvenation is transparent to the user and such that modification may be performed from outside of the system.
In a first aspect of the invention, a method of enhancing software dependability is provided which includes measuring a time elapsed in a software system running on a computer, determining whether the time matches a threshold, and when the time matches the threshold, rejuvenating at least a portion of the software system to avoid or reduce the likelihood of an outage and without modifying an application running in the software system.
In a second aspect of the present invention, a method for software rejuvenation, includes waiting for a selected inter-rejuvenation interval to expire, determining whether a fail-to node has adequate resources to accept a failover workload, if the determining is that the fail-to node cannot accept the failover workload, sending an alert that adequate resources do not exist to support fault tolerance requirements, and suspending rejuvenation until an operator acknowledges and corrects the deficiency.
In other aspects of the invention, a system and signal-bearing medium storing the method are provided.
Thus, the present invention provides a time-based rejuvenation in several environments (e.g., the preferred embodiment being within a clustered environment). In the invention, the application need not be modified or to include calls to a fault tolerance library. Indeed, no such modification is necessary, and applications obtain the benefit of software rejuvenation without any modification. In addition, no proprietary, dedicated system support is required for the rejuvenation functionality. The use of rejuvenation by the present invention is performed within the context of any industry-standard clustering environment. These enhancements significantly expand rejuvenation""s potential applicability and ease of use by incorporating its set-up and use within an already familiar management and operational infrastructure and not requiring the operator to use yet another framework.
Transparent Time-Based Selective Software Rejuvenation (TSR), according to the present invention, allows a system operator to cause selective software rejuvenation to be performed automatically, on a periodic basis, without operator intervention, and at a time which is least disruptive to system operation.
TSR is transparent to the application program. That is, no changes to the application software are required to obtain the benefits of TSR. When software rejuvenation is invoked within a cluster environment, the cluster fail-over services (i.e., Microsoft(copyright) Cluster Services or HACMP) may be used to xe2x80x9cgracefullyxe2x80x9d stop the offending subsystem and restart it on the same or another node in the cluster, in a controlled manner. For purposes of the present application, xe2x80x9cgracefully stoppingxe2x80x9d means terminating a program or operation in a pre-planned, purposeful manner such as closing out files properly and without any effect to any other subsystem or system performance.
With the unique and unobvious aspects of the present invention, unplanned outages due to resource exhaustion are avoided by periodically rejuvenating the software system. This benefit is provided transparently to the application software. Moreover, since the fail-over capabilities of a clustered system can be used, very little downtime is incurred to perform the rejuvenation. Further, an automatic and continual check on the system""s capability to tolerate an unplanned failure can be performed.