1. Field of the Invention
The present invention relates to the design of highly reliable computer systems. More specifically, the present invention relates to a method and an apparatus that uses pattern-recognition techniques to trigger software rejuvenation in order to enhance performance and availability in computer systems.
2. Related Art
As electronic commerce grows increasingly more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is critically important to ensure reliability in such enterprise computing systems.
Unfortunately, as computer systems run for longer periods of time, they are increasingly affected by a phenomenon known as “software aging.” Software aging is typically caused by resource contention problems that build up over time until the computer system eventually hangs, panics, crashes or otherwise grinds to a halt. Software aging can be caused by a multitude of factors, including memory leaks, unreleased file locks, accumulations of unterminated threads, data round-off accrual, file space fragmentation, shared memory pool latching and thread stack bloating.
Many of the adverse effects of software aging can be mitigated through a technique known as “software rejuvenation.” Software rejuvenation operates by cleaning up the internal state of a computer system and/or application to prevent the occurrence of more severe crash failures in the future. For some extreme problems, software rejuvenation can involve therapeutic reboots. However, less drastic measures suffice for the vast majority of software aging problems, such as flushing stale locks, reinitializing application components, preemptively rolling back, defragmenting memory and shutting down individual applications.
If the software aging is caused by parasitic resource consumption (for example, a memory leak), periodic software rejuvenation can restore the resource and can thereby avoid a system crash caused by shortage of the resource as is illustrated by FIG. 1.
Unfortunately, it is very hard to determine when these software rejuvenation operations are required. Some existing systems monitor a single system parameter. For example, some systems monitor an amount of free memory, and if the amount of free memory falls below a threshold value, they perform a software rejuvenation operation in an attempt to free up some memory. Unfortunately, this technique is only effective in mitigating known types of software aging problems (such as memory leaks) that can be detected by monitoring a single system parameter.
Other systems perform software rejuvenation at periodic intervals. However, this may not catch software aging problems that arise between the periodic intervals. On the other hand, if these periodic rejuvenation operations are performed too frequently, they can unnecessarily degrade system performance.
What is needed is a method and an apparatus for performing software rejuvenation operations without the limitations and problems of the above-described techniques.