In a complex system such as a computer processor based system, if an error is detected frequently, system reliability may be poor even if the error is due to an intermittent fault. The component having an intermittent fault which is detected frequently may eventually cause a fatal fault even if the component does not have a fatal fault. Also, the component lowers system reliability and requires time for recovering the fault (e.g., correcting the error), thereby deteriorating the system performance.
As is known in the art, large host computers and servers (collectively referred to herein as “host computer/servers”) require large capacity data storage systems. These large computer/servers generally include data processors, which perform many operations on data introduced to the host computer/server through peripherals including the data storage system. The results of these operations are output to peripherals, including the storage system.
One type of data storage system is a magnetic disk storage system. Here an array or bank of disk drives and the host computer/server are coupled together through a system interface. The interface includes “front end” or host computer/server controllers and “back-end” or disk controllers. The interface operates the controllers in such a way that they are transparent to the host computer/server. That is, data is stored in, and retrieved from, the bank of disk drives in such a way that the host computer/server merely thinks it is operating with its own local disk drive. One such system is described in U.S. Pat. No. 5,206,939, entitled “System and Method for Disk Mapping and Data Retrieval”, inventors Moshe Yanai, Natan Vishlitzky, Bruno Alterescu and Daniel Castel, issued Apr. 27, 1993, and assigned to the same assignee as the present invention.
Given the large number of disk drives in a typical implementation, there is a reasonable likelihood that one or more disk drives will experience an operational problem that either degrades drive read-write performance or causes a drive failure. This is because disk drives are complex electromechanical systems. Sophisticated firmware and software are required for the drive to operate with other components in the storage system. The drives further incorporate moving parts and magnetic heads which are sensitive to particulate contamination, and electrostatic discharge (ESD). There can be defects in the media, rotational vibration effects, failures relating to the motors and bearings, and other hardware components or connections. Some problems arise with respect to drive firmware or drive circuitry. Environmental factors such as temperature and altitude can also affect the performance of the disk drive. Thus, drives can fail and the failure can be significant if there is a nonperformance of the drive.
Many disk drives used in data storage systems include firmware/processor which monitors the performance and operation of the disk drive. If such firmware/processor detects a fault in such operation, it sets a bit in a register in the disk drive and takes such disk drive in a bypass state (i.e., off-line) (also known as bypass condition) for a short period of time, typically in the order of, for example, 200 milliseconds, thereby disabling its access by the host computer. More particularly, the system interface includes a diagnostic section (which may be included within the controllers) which regularly polls (i.e., inspects) at a rate of typically 500 milliseconds, for example, the state of the bit register in each of the disk drives. In one system, whenever the diagnostic section detects that the bit register in a disk drive has been set, i.e., the disk drive is in a bypass condition, such bypass condition is reported to the system interface control section (i.e., the controllers) thereby advising the controllers to no longer access (i.e., write to or read data from), the bypassed disk drive. It is noted that the diagnostics, when it detects a bypass condition, i.e., a set bit, does not know whether the bypass is only temporary or permanent. That is, the diagnostics does not know whether the disk drive will have its bypass condition removed and thereby again be operational. The polling continues and if the disk drive bypass condition is removed, the system interface commences a rebuilding of data operation using error correction and detection codes (i.e., a data reconstruction operation). If during the rebuilding process, a new poll indicates that the disk drive is again in a bypass condition, the system interface must again re-start the data rebuilding process. Further, once the disk drive is placed in a non-accessible condition, the system interface commences the rebuilding of data operation using error correction and detection codes and using a spare disk drive in the array or bank of disk drives, sometimes referred to as a “hot spare” disk drive, to immediately and automatically replace the bypassed disk drive. Thus, once a hot space switches into the system, the data reconstruction must be made using the hot spare before the data can be re-written from the hot spare back into the bypassed, and now perhaps physically replaced disk drive. This process can take from between 30 minutes to perhaps several days. Thus, the possibility of repeated response to bypass condition bits by the disk drive reduced the efficiency of the data storage system and leaves the data vulnerable to data loss should a second fault occur.