1. Field of the Invention
The present invention relates to techniques for designing highly reliable software systems. More specifically, the present invention relates to a method and an apparatus that estimates time to failure in a software system and provides quantitative confidence bounds around this estimation.
2. Related Art
When computer systems run for long periods of time, they are increasingly affected by a phenomenon known as “software aging,” which is typically accompanied by performance degradation of the computer systems over time, and can eventually lead to a crash of user applications and even the entire computer system. Software aging can be caused by a multitude of factors, including memory leaks, unreleased file locks, accumulation of unterminated threads, accumulation of numerical errors, file space fragmentation, shared memory pool latching and thread stack bloating.
For example, a memory leak is a common type of software aging mechanism which is caused by a failure to release memory when the memory is no longer needed by a program. Long-running programs with memory leaks and programs that allocate memory extensively can consume enough memory to seriously hinder overall performance of the computer system, or even worse, to cause an application or the entire system to crash. This problem becomes even more acute in multi-user environments, where a large number of users can be affected by a single application with a memory leak.
Note that a memory leak causes the computer system as a whole, not merely the erroneous process, to use an ever-growing amount of memory. Eventually, much (or all) of the available memory will be allocated (and not freed), thereby causing the entire system to become severely degraded or to crash. System administrators typically do not receive a warning about this problem until 95%-98% of the available memory has been used up. In most cases, this is too late to initiate any preventive maintenance actions and can end up causing costly system downtime.
Although we have discussed the software aging problem using the example of memory leaks, similar problems arise with other system resources, such as file tables, process tables and other kernel structures. Hence, solutions to the memory leak problem can be generalized and extended to these other system resources as well.
A number of approaches have been taken to deal with the problems related to software aging. For example, some existing tools facilitate debugging programs and detecting resource leaks when the source code is available. However, these existing tools cannot be used when the source code is not available; for example, when third-party and off-the-shelf software is used.
Another approach to deal with resource leaks is based on threshold limits. In this approach, alarms are issued when the resource consumption exceeds a predetermined limit. When such limit is reached, preventive actions such as software rejuvenation operations can be initiated. Unfortunately, such predetermined threshold limit is usually set arbitrarily or subjectively. Note that a threshold limit that is set too low causes increased false alarms, thereby making preventive maintenance policies inefficient; whereas a threshold limit that is set too high results in missed alarms which causes unplanned outages.
Preventive maintenance policies based on time are sometimes used to solve the problem of software aging. In this approach, the threshold which triggers preventive actions is an “elapsed time”. Specifically, preventive maintenance is initiated at predetermined deterministic time intervals. For example, there can be a policy to reboot system every Saturday at midnight. However, this technique also suffers from the same problems of possible false alarms and missed alarms as described above.
Note that none of the above-described techniques provide estimates for a remaining time to crash/hang, i.e., a remaining time to failure of a system due to the software aging. If estimates can be made for the remaining time to failure due to a software aging mechanism, preventive actions such as software rejuvenation can be optimally scheduled to avoid potentially serious unplanned outages.
One technique that detects software aging and predicts remaining time to failure involves detecting gradual resource exhaustion in a computer system. This technique performs time-series analysis to detect trends in resource usage and to estimate the time to resource exhaustion based on the detected trends. Preventive actions can be taken accordingly to avoid impending failures. Unfortunately, this technique has several drawbacks. Firstly, it does not pinpoint the offending process, and hence, the entire system may have to be rebooted. Secondly, it provides no feedback to facilitate root-cause analysis. Furthermore, subtle memory leaks cannot be detected when the memory usage is heavy and “noisy,” which is commonly the case in multi-user server systems.
Hence, what is needed is a method and apparatus for estimating remaining time to failure for computer systems due to software aging without the above-described problems.