Business success of an enterprise can be highly dependent upon availability of information technology (IT) resources. System downtime can be very expensive, for some business organization in the ranges of millions of dollars per hour. Thus when a System Crash occurs, business success can be highly dependent on performance that can be measured in metrics such as time-to-recovery and time-to-problem-resolution. A conventional system crash analysis paradigm includes a dump of system information, reboot and recovery of the system, then analysis of the dump. The dump analysis generally occurs long after the crash and recovery and is performed by persons with expertise in software and/or hardware of the crashed system. The dump files are commonly transferred to experts at a service organization of a supplier for the crashed system, adding a long delay to the time-for-problem-resolution, due to logistics involved in transferring the dump files of the service organization. In common conditions, the transfer can take hours, days, or even weeks since some of dumps are up to Gigabytes in size, resulting in delay for mailing, handling, and receiving a dump tape.
System crashes can be considered to fall into three main categories including operating system crashes, hardware machine checks, and hung systems. Operating system crashes and hardware machine checks are commonly addressed by a system memory dump, also called a core dump. Memory dumps can take a very long time to perform due to ever-increasing maximum memory configurations, up to one terabyte for large servers and expected to rise to eight terabytes in the near future. After a system crash, acquisition of the memory dump can greatly slow system time-to-recovery, because the memory dump process is slow and time consuming. Some information technology (IT) system users, due to business pressures, now eliminate acquisition of memory dumps after a system crash to accelerate system time-to-recovery, a practice that increases business risk because the problem can recur since the root cause of the system crash is not determined. Failure to obtain the memory dump results in no data for problem analysis.
A hardware crash can be caused by either a hardware error or software passing an invalid address to the hardware. A dump-and-then-analyze paradigm generally demands that all possible data is dumped because the information useful for a particular problem analysis is typically unforeseen and unforeseeable. Therefore, a hardware crash typically demands both a hardware crash dump file and a system memory dump file for suitable analysis, resulting in a long time-to-recover because system memory dumps are large and take a long time to perform. Field data from users with large IT installations indicate that a high percentage of hardware crashes do not result from data addressing related failures. Accordingly, for many or most hardware crashes, a system memory dump is a waste of time.