The invention relates to computer operating systems, and in particular to the art of diagnosing failures in computer systems so that corrective action may be taken.
As with anything else made by man, computer operating systems are known to fail. Operating system failures that result in cessation of operation are known as fatal operating system crashes. Frequent crashes can cause substantial interference with use of a system, both through system unavailability and possible data loss.
In addition to fatal crashes, operating systems may encounter recoverable abnormal conditions. These abnormal conditions may also interfere with use of the system, and severe ones can also be considered system crashes.
Operating system crashes have many different causes. These include hardware defects, programming errors in operating system modules, misconfiguration of the system or of driver modules, programming errors in application programs running on the system, and incompatibilities between operating system and driver modules. Commercial operating systems may have hundreds of potential causes of crashes.
Many suppliers of operating systems have contractual obligations to provide maintenance by helping their customers avoid repeated crashes. Maintenance contractors also contract to help customers avoid repeat crashes. Many problems that cause system crashes can be fixed to prevent repeated crashes. Fixing crash causes requires that the causes be understood because xe2x80x9cfixesxe2x80x9d applied blindly can not only fail to fix the problem, but introduce new problems into or aggravate old problems of a system.
Many crash causes that occur on a customer""s machine have or will cause crashes on machines of other customers. Many maintainers of operating systems therefore maintain crash databases of information about past crashes, with underlying cause information and possible fix information for those crashes.
Analysis of operating system crashes to determine underlying causes is often performed manually by skilled technicians. These technicians perform dump analysis by reviewing xe2x80x9ccrash dumpsxe2x80x9d and error logs recorded by the system at the time of the crash, as well as a crash database. A xe2x80x9ccrash dumpxe2x80x9d is typically a recording, often formatted for printing, of relevant portions of system memory and register contents as they existed at the time the system crashed. Crash dumps are often recorded in a dump file on a filesystem of the machine that has suffered an operating system crash.
Manual dump analysis by skilled technicians is time consuming and expensive. Dump analysis is particularly expensive because of the high level of training and experience required before a technician is sufficiently expert to perform manual dump analysis accurately. It is therefore desirable that dump analysis be automated.
Crash dump files may be extremely large. Individual dump files may be tens to several hundreds of megabytes in size; it is therefore undesirable to store large numbers of crash dump files on a customer""s machine.
U.S. Pat. No. 5,111,384 describes a system wherein portions of dump files are transmitted on request from a remotely located host system that has crashed to a centralized system having an expert system. The expert system thereupon analyzes the dump files to determine whether they match a known pattern in its knowledge base, and reports which if any known pattern scores a match.
Many operating systems have diagnostic modes wherein their functionality is restricted, but their reliability is enhanced. For example, the UNIX and LINUX operating systems have a single-user mode, and the Windows system has its Safety Mode. Further, a second, diagnostic, copy of an operating system may be installed on a machine with the minimum set of drivers needed for basic functions. These diagnostic modes may permit access to a system despite significant misconfiguration or bugs; it is known that these diagnostic modes can be substantially more robust than the normal operating mode for the same operating system on the same machine.
An intelligent system, the Crash Analysis Tool (CAT), for interpreting and analyzes operating system crashes has been constructed.
This CAT has a parameter extraction module that runs when the system reboots. In the event that the reboot was a result of a system crash, this module collects a predetermined set of operating-system-dependent key fields and parameters, including parameters expected to be of use in diagnosing the underlying causes of crashes. Extracted parameters are stored as a crash footprint in a footprint file.
When analysis is desired, a collector and parser module gathers the key fields of the footprint from the footprint file and translates this information into a suitable format for an analysis engine. The analysis engine then locates any matching rule in its knowledge base. If a match is found, repair suggestions from a repair suggestion file is merged with the footprint and formatted for display to a technician. If no match is found, the footprint information is formatted and displayed.
CAT can be run under any of several operating systems, including systems selected from Linux, OpenVMS, Windows NT, and Compaq Tru64 Unix, and is operable on a variety of hardware, including Alpha and Intel Pentium family and Xeon processors. CAT can run under a different hardware and operating system combination than that of the crashed system.
The foregoing and other features, utilities and advantages of the invention will be apparent from the following more particular description of a preferred embodiment of the invention as illustrated in the accompanying drawings.
FIG. 1 is a block diagram of a computer known in the art, showing operating system run-time storage and a dumpfile;
FIG. 2, a flow diagram of the automated dump analysis of the present invention;
FIG. 3, a flow diagram illustrating alternatives for analysis on systems suffering problems of various severity; and
FIG. 4, a flow diagram illustrating second pass analysis.