As computers and operating systems become more complex, failure and performance analysis become more difficult. For reliability reasons, it is highly desirable to exhaustively test computer systems prior to introducing them into service. Generally speaking, it is possible to create codes to test specific portions of the computer system hardware. However, such narrowly focused tests do not stress the entire computer system as a whole, and do not reflect the way in which the computer system tends to be used in the field.
Another approach involves employing diagnostic tools, such as diagnostic exercisers, that execute under operating systems to detect errors. Nowadays, diagnostic exercisers are typically designed to execute under a production operating system (“OS”), e.g., Linux, HPUX™ (Hewlett-Packard Company, Palo Alto, Calif.) or Windows (Microsoft Corporation, Redmond, Wash.), and the like. By employing specifically designed applications that use special drivers and/or operating system calls, diagnostic exercisers executing under a production OS can stress the computer system in specific ways and detect certain errors when they occur. Because these diagnostic exercisers employ the OS, their applications can stress the hardware more fully and in ways that are more similar to the way the computer system would be used in the field, e.g., involving features such as multi-threading, rapid disk I/O operations, and the like.
In general, the amount of stress experienced by the computer system can vary depending on which diagnostic exerciser applications are being executed. When a given diagnostic exerciser executes under a production operating system, such diagnostic exerciser is however restricted in capabilities by the operating system. For example, due to competitive reasons, production operating systems tend to be optimized for performance in the field and not for extensive diagnostic capabilities. Thus, a diagnostic exerciser executing under such a production OS tends to be restricted in its diagnostic capability. Analogously, other features useful for data gathering, error detection, and error analysis may not be available in a production OS since the provision of such features would unduly hamper OS performance and/or compromise the security of the computer system in the field.
Furthermore, most diagnostic exercisers executing under a production OS tend to run as privileged processes and/or drivers. This fact tends to limit their usefulness in a kernel crash situation. For example, if the kernel crashes during kernel booting, the common diagnostic monitor of the prior art diagnostic exerciser would not have been started, thereby being of little value for the detection and analysis of kernel boot failures.
Additionally, when the kernel crashes, most common diagnostic techniques involve obtaining a kernel dump and analyzing the kernel dump data. However, such diagnostic techniques are inherently limited by the information saved in the kernel dump file. If the programmer who originally designed the kernel dump program decided not to save certain information, such information will not be available in the kernel dump file for analysis. Furthermore, the content and format of the information available in the kernel dump file may vary depending on the discretion of the programmer who originally designed the kernel dump. Accordingly, the kernel dump information may be difficult to decipher, and it may be necessary to find the programmer who originally designed the kernel to obtain assistance in analyzing the kernel dump file. If a long period of time has elapsed between the time the OS is produced and the time a particular error occurs that results in the kernel dump, the programmer may no longer be available to assist, and the use of kernel experts maybe required to sift through the kernel dump data.