1. Field of the Invention
The invention relates generally to the field of data storage in computer systems and, more specifically, to a technique for handling hardware errors while avoiding a system crash.
2. Description of the Related Art
A hardware error such as a machine check in a computing system such as a normal UNIX system will cause the system to crash. Normally, it will not even allow applications to have a chance to log any information. When information can be logged, it is used to identify the faulty component only after the image is rebooted. A machine check is always considered as a system fatal error. In a data storage facility, an example of which is the IBM pSeries system, there are many conditions that can cause a machine check, such as target abort, master abort, or parity error. In a general purpose UNIX server, it is reasonable to invoke a machine check for those conditions. The data storage facility becomes temporarily unavailable in such situations.
Furthermore, a multi-cluster data storage facility, an example of which is the IBM TotalStorage ESS storage server, is a closed environment with its own host adapters and device adapters and respective device drivers. If any of these hardware adapters causes a peripheral component interconnect (PCI) error such as a target abort, the entire cluster, or computer-electronic complex (CEC), will be crashed and rebooted. During this time, the data storage facility will run in a single cluster mode. However, this is undesirable since the functionality and performance of the data storage facility is impaired.
Accordingly, it would be desirable to provide a procedure for handling hardware errors in a computing system in a way that enables the system to continue to function, without causing a system crash.