Modern computer systems are complex electronic systems made of many computer hardware devices, such as various processors, memory modules, storage devices, networking devices, etc., and sophisticated computer software programs, such as operating systems, device drivers, and application software programs. Computer system maintenance is therefore essential to keep the computer systems from abnormal conditions or failures. However, with the ever growing complexity of modern computer systems, it is sometimes difficult to determine the root cause of a system problem. Computer operating systems or other computer diagnostic programs often provide debugging methods or diagnostic tools to help identify system problems.
One of these tools is called a crash dump, which saves status information of various computer devices upon an occurrence of a system problem in a predetermined memory or storage location for diagnostic purpose. The status information is often manually reviewed by troubleshooting personnel to determine underlying causes of the system problem. Sometimes, after the troubleshooting personnel reviews the crash dump, the system problem may appear to be caused by failures of numerous input and output (I/O) devices, such as disk drives. Traditionally, the troubleshooting personnel would try to fix the system problem by replacing a first seemingly bad I/O device. If the problem persists, a second seemingly bad device is then replaced. This process would be repeated until either the problem goes away, or all the devices are replaced while the problem is still unresolved. However, this “trial and error” method generally fails to pinpoint and isolate the device problem, thus increases the computer system downtime. Moreover, this method may fail if the system problem is caused by failure of an intermediate or internal device, such as I/O controller, i.e., the seemingly failure of the I/O device is a side-effect of the failure of the internal or intermediate device.