1. Technical Field
The present invention relates to software rejuvenation, and more particularly to a system and method for tuning a software rejuvenation method using a customer affecting performance metric.
2. Discussion of Related Art
Replication of components is often used to preserve continuity of service in web-based systems, telecommunication systems, and other systems needing a high degree of reliability. Replication improves performance by allowing the load to be spread among multiple servers. When paired replicates are engineered so that the peak load does not cause the utilization of any resource on any of them to exceed a threshold, e.g., 40%, replication increases reliability by allowing each replicate to act as a standby for the other while maintaining acceptable service. If the offered load is balanced among replicated servers that are programmed identically, which is the case with clusters of web server platforms such as those supported by WebSphere™ and WebLogic™, faults that are consequences of software aging are likely to occur in all replicates at about the same time if they are booted or rejuvenated at the same time. If the parameters governing rejuvenation are substantially identical in all replicates, a traffic-based method of rejuvenating aging software will be triggered on all of them at about the same time. This undermines service continuity.
Large industrial software systems need extensive monitoring and management to deliver expected performance and reliability. Some specific types of software failures, called soft failures, have been shown to leave the system in a degraded mode, where the system is still operational, but the available system capacity has been greatly reduced. Examples of soft bugs have been documented in several software studies. Soft failures can be caused by the evolution of the state of one or more software data structures causing performance degradation. This performance degradation is called software aging. Software aging has been observed in widely used software. An approach for system capacity restoration for telecommunications systems that took advantage of the cyclical nature of telecommunications traffic was proposed. Telecommunications operating companies understand the traffic patterns in their networks well, and therefore can plan to restore their smoothly degrading systems to full capacity in the same way they plan their other maintenance activities. Experience has demonstrated that soft bugs occur as a result of problems with synchronization mechanisms, e.g., semaphores; kernel structures, e.g., file table allocations; database management systems, e.g., database lock deadlocks; and other resource allocation mechanisms that are essential to the proper operation of large multi-layer distributed systems. Since some of these resources are designed with self-healing mechanisms, e.g., timeouts, some systems may recover from soft bugs after a period of time. For example, for a specific Java based e-commerce system, when the soft bug was revealed, users were complaining of very slow response time for periods exceeding one hour, after which the problem would clear by itself.
If all parameter settings in all copies of a replicate are equal, all replicates of a system are likely to undergo rejuvenation at about the same time. This diminishes the utility of having replicates.
Therefore, a need exists for a system and method for software rejuvenation triggered on different servers at different times.