After a piece of computer hardware, such as a memory device is developed, it undergoes a testing phase to identify all possible errors followed by their debugging. In a cluster, for instance, where multiple identical units of hardware are used, the hardware errors need to be analyzed in every single piece of the hardware, for instance, all of the host adapter cards.
When a hardware error flag, such as Header Longitudinal Redundancy Check (or Header LRC check) is raised, the hardware goes through a delayed reset. If multiple host adapter cards are diagnosed with errors, the system delay time is increased due to multiple hardware resets. Most of the time, on the other hand, the indicated hardware errors are due to errors in the microcode and the data in the system cache is correct. The microcode error can raise the same type of hardware error flag in most of the host adapter cards, indistinguishable from a true hardware error. Therefore, each indicated error takes down host adapter cards one after another, and finally lead to the failure of I/O and loss of access.
A true hardware error can be related to problems with tracks on the drive, in cache or the host adapter cards. Hardware errors are not unexpected but numerous occurrence of similar hardware errors on different pieces of similar hardware are unexpected and can be attributed to a microcode error. In this case, it is desired not to continue taking any more recovery actions on the hardware. The general ability to drive I/O is affected by the success in error recovery when problems occur on the I/O processing hardware. Too general a granularity in thresholding errors can result in a loss of access to hardware.