1. Field of the Invention
The present invention relates to determining the cause of a computer system failures. More particularly, the present invention relates to a method and an apparatus for analyzing post-mortem information from a computer system failure on a remote computer system by downloading a code module that executes on the remote computer system.
2. Related Art
When a computer system crashes an exception handling routine typically saves post-mortem information specifying the state of the computer system after the failure to a crash dump file. This crash dump file typically contains much of the contents of the memory of the computer system immediately after the failure, including the state of various threads and the contents of various buffers. By viewing this crash dump file, an engineer is often able to diagnose the cause of the computer system failure.
As computer systems increase in size and complexity, crash dump files can become large. It is not uncommon for a crash dump file to be as large as one gigabyte. This large size creates logistical problems in bringing the crash dump file and the engineer together. Requiring the engineer to travel to the customer site can be very expensive and can involve long delays, especially if the engineer must travel across the country or between continents.
Alternatively, the crash dump file can be sent to the engineer""s computer system. Unfortunately, transmitting a very large file across a computer network can take many hours, if not days. Consequently, it is common for a crash dump file to be copied onto a magnetic tape in order to be mailed to the engineer.
Additional copies of the crash dump file may have to be made if system developers and/or engineers for third party subsystems become involved in the debugging process. It is not uncommon for five or six copies of a crash dump file to be made and distributed to different people during in the debugging process. This process of making additional copies is very time-consuming and takes up a great deal of storage space on the various computer systems that are involved.
Furthermore, security is concern in making a crash dump file available to the engineer or other interested parties. For security reasons, is undesirable to allow anyone to log into the customer computer system in order to view the crash dump file. It is also undesirable to make the crash dump file publicly available because the crash dump file can potentially contain any of the information that is stored on the computer system, such a payroll information or technical trade secrets.
What is needed is a method and an apparatus that allows an engineer and other interested parties to view and manipulate post-mortem information from a computer system failure without the delay and costs involved in transporting the engineer to a remote location, or in transporting a large crash dump file to the engineer.
One embodiment of the present invention provides a system for analyzing post-mortem information specifying a state of the remote computer system after the failure of the remote computer system. The system operates by receiving a code module sent from a debugging computer system at the remote computer system. The remote computer system executes the code module, and allows the executing code module to read the post-mortem information from the remote computer system. The remote computer system also allows the executing code module to generate a result, and returns the result to the debugging computer system.
In one embodiment of the present invention, the code module includes platform-independent JAVA byte codes that are executed on a JAVA virtual machine located on the remote computer system.
In one embodiment of the present invention, the system allows a user of the remote computer system to specify a security policy for the executing code module.
In one embodiment of the present invention, specifying the security policy includes specifying a file on the remote computer system that can be accessed by the executing code module, and specifying a valid source from which the code module can be accepted.
In one embodiment of the present invention, the post-mortem information includes a crash dump file specifying the state of the remote computer system after the failure of the remote computer system.
In one embodiment of the present invention, the system additionally maintains a log of actions performed by the executing code module.
In one embodiment of the present invention, upon detecting the failure of the remote computer system, the system records post-mortem information for the remote computer system, and notifies a user of the debugging computer system that the remote computer system has failed.
In one embodiment of the present invention, generating the result involves analyzing the post-mortem information in order to determine a cause of the failure of the remote computer system.