The present invention relates generally to distributed computing systems and, more specifically, to preserving data for diagnosing crashes in such systems.
A crash in a computer system is a serious failure in which the computer stops working or a computer program aborts unexpectedly. A crash signifies either a hardware or a software malfunction. Exemplary causes of system crashes include memory access violation, bad pointers, or violation of assertion conditions in a program. Effectively diagnosing a crash is complex, and this complexity is exacerbated in distributed systems in which multiple nodes participate in an operation. This is because, in distributed systems, multiple nodes interface with each other, and a crash on a particular node does not necessarily mean that the cause of the crash originates from that node. The cause of the crash may be, for example, a message that was transmitted to the crashed node and that subsequently causes the crash. In various cases, the sequence of events leading to the crash may spread across numerous nodes. Further, because only one node in the multiple nodes crashes, the non-crashed nodes continue to function and thus change the overall state of the system, which makes it more difficult to identify causes of the crash.
Currently, when a system crashes, diagnostic programs typically perform a xe2x80x9ccore dump,xe2x80x9d which provides information to be analyzed as to the cause of the crash. Such information reflects the system state of the crashed node at the time of crash, addresses of memories, program counters, etc. However, because other nodes interfacing with the crashed node are still functioning, the state of the non-crashed nodes continues to change. Having data from the crashed node is useful, but, in many cases, is not sufficient for identifying the cause of the crash.
Based on the foregoing, it is clearly desirable to provide better techniques for diagnosing crashes in systems in which multiple nodes participate in operations.
Mechanisms are provided for preserving state information in response to errors that occur in operations in which multiple nodes are participating. In one embodiment, when an error occurs, one or more execution units are suspended. These execution units may be on the node on which the error occurred (the xe2x80x9cerror nodexe2x80x9d) and/or on other non-error nodes. In this context, the term xe2x80x9cexecution unitxe2x80x9d refers to a program that executes a particular task. State information is collected from both the suspended execution units and the error node in which the error occurred. All suspended execution units are then released, i.e., allowed to continue execution at the point where the units were suspended. The data collected during suspension is then used for diagnosing the error.
According to one embodiment, the type of error event dictates which execution units to be suspended and the type of information to be collected from the execution units that have been suspended.
In accordance with various embodiments of the invention, suspension of execution units provides a window of opportunity to collect all relevant information necessary for identifying causes of a crash. Further, the collected data are analyzed xe2x80x9coff-line,xe2x80x9d without affecting usage of the involved system.