1. Field of the Invention
The present invention is related to computer systems. More particularly, the present invention is directed to a method and system for diagnostic preservation of the state of a computer system after a failure.
2. Background
Computer systems employ a variety of error handling procedures to address the occurrence of a system or application failure. Often, the error handling procedures perform actions that are designed to aid in the later “debugging” or diagnostic analysis of the failure.
One common approach taken by error handling procedures to assist in debugging a failure is to perform a “core dump” operation after the occurrence of a system or application failure. A core dump operation generates a core dump file, which is an image of the system memory at the time of the core dump. The system memory information in the core dump file contains data relating to the identity and state of pending application processes on the system. The core dump operation is normally performed at the moment the failure is detected, so that the state of the system memory at the time of the failure is preserved for future analysis. Analysis of the core dump file can be performed to assist in debugging the underlying programming or system errors that caused the failure.
One significant drawback to debugging using core dump analysis is that under certain circumstances, the core dump file may not contain enough information about the state of the system to sufficiently debug a failure. For example, since the core dump file only contains an image of the system memory, the state of other resources on the system, such as network and I/O resources, are not preserved in the core dump file. Successful debugging operations using core dump analysis may be rendered impractical or impossible if the state of these other resources are needed to fully perform an analysis of the failure.
In addition, core dump files are typically very large files that are notoriously difficult to analyze. Even with advanced core dump analysis tools, debugging a failure by analyzing a core dump file remains a very involved and complex process. Moreover, since the computer system's main memory can be quite large, generating one or more core dump files can consume a costly amount of disk space. A significant amount of time may need to be consumed to generate the core dump file, which can delay recovery and error handling operations on the system. Under certain circumstances, an application that is completely frozen may not be able to execute error handling procedures to generate a core dump file, thereby leaving no record of the state of the system at the time of failure for future diagnosis purposes.
To address these issues, an aspect of the present invention is directed to a method for diagnosing failures on a computer system without the need to generate and analyze core dump files. Failures can be diagnosed by performing real-time debug operations onto a live computer system, to directly analyze and examine the system resources of the failed system. In this manner, failures can be analyzed without the use of core dump files.
To effectively perform real-time debugging on a live computer system after a failure, the states of system/application resources must be properly preserved from the moment of a failure. A significant problem in implementing this aspect of the invention on conventional computer systems is that resources on the computer system may be modified or reallocated subsequent to a failure. For example, error handling procedures on conventional computers systems commonly change the state of resources on computer systems after a failure, often in an attempt to maintain the continued availability of the system to users.
One approach to achieving continued availability of a computer application after a failure is to utilize “fail-over” procedures. With fail-over procedures, each primary computer is associated with one or more redundant computers on a network. If a failure occurs on a primary computer, the network address and other identification settings for that primary computer are moved over to a backup computer. All further requests to that system will thereafter be directed to the backup computer, which effectively becomes the new primary computer.
Conventional fail-over techniques are optimized around application availability, with the goal to regain access for users as quickly as possible on a redundant computer, and to repair the failed computer so that it can act as a standby should a new failure occur to the redundant computer (which is the new primary). However, such techniques often involve changes to the state of system resources on the failed computer, causing modification or reallocation of resources that may be needed to debug/diagnose the cause of the failure. Moreover, the diagnostic analysis operations themselves will require the use of system resources, thereby potentially causing changes or modifications to resources that must be examined to diagnose the failure. Conventional systems do not provide a method or mechanism to prevent the change or destruction of the state of resources that need to be analyzed in real-time during the debug operations.
Based upon the foregoing, it is desirable to provide a mechanism and method for handling a failure on a computer system that preserves the state of resources on the system for diagnostic purposes.