Application server environments are prone to a variety of problems, e.g., malfunctions, caused by the inefficient design of hosted applications. Typical problems include memory leaks, deadlocks, inconsistent state and user errors. These deficiencies have an adverse effect on the near-term performance and/or availability of the application. In most cases, these conditions can be detected through appropriate instrumentation by a human administrator, who in turn decides on the best course of action to correct the problem.
Each condition requires a particular corrective action that ranges from non-intrusive software reconfiguration to more drastic techniques, such as restarting the application server and its hosted applications. The latter is also known as “software rejuvenation,” and is commonly used to remedy many software problems, including, memory leaks and deadlocks. See, for example, Y. Huang, et al., Software Rejuvenation: Analysis, Module and Applications, IEEE Twenty-Fifth International Symposium on Fault-Tolerant Computing, 381-390 (1995), the disclosure of which is incorporated herein by reference. A system can selectively rejuvenate software based on measurements that indicate an impending outage. See, for example, U.S. Pat. No. 6,629,266 issued to R. E. Harper et al., entitled “Method and System for Transparent Symptom-Based Selective Software Rejuvenation,” the disclosure of which is incorporated herein by reference. If the system is part of a cluster, the system may determine whether another cluster member can accept the workload serviced by the application requiring rejuvenation. If so, the system can interact with a cluster manager to start an instance of the application on another node.
In cluster systems, such as the Windows NT® cluster system, failure detection is provided for applications running unmodified on a cluster. See, for example, R. Gamache et al., Windows NT Clustering Service, IEEE COMPUTER, 55-62 (October 1998), the disclosure of which is incorporated herein by reference. An application-specific cluster interface layer, through which an application can be started, stopped and monitored for failures, may also be provided. For example, a monitor may include application requests that serve as probes to determine if the application is operating correctly.
An extensible infrastructure for detecting and recovering from failures in a cluster system is described, for example, in U.S. Pat. No. 5,805,785 issued to D. Dias et al., entitled “Method for Monitoring and Recovery of Subsystems in a Distributed/Clustered System,” the disclosure of which is incorporated herein by reference. Basic failure detection using heartbeating (e.g., noting nodes that have gone down or come up on a particular network) is augmented by user-defined monitors to detect failures in specific subsystems, and user-defined recovery programs to recover from the failures detected. A “rolling upgrade” in which upgrades in a cluster are performed in a wave so that only one node is unavailable at a time is described, for example, in E. A. Brewer et al., Lessons from Giant-Scale Services, IEEE INTERNET COMPUTING, 46-55 (July/August 2001), the disclosure of which is incorporated herein by reference.
Despite the recent progress in application server failure detection and rejuvenation, there exists a need for improved techniques for efficiently and effectively monitoring application server environments and addressing errors occurring therein.