As is known in the art, large host computers and servers (collectively referred to herein as “host computer/servers”) require large capacity data storage systems. These large computer/servers generally includes data processors, which perform many operations on data introduced to the host computer/server through peripherals including the data storage system. The results of these operations are output to peripherals, including the storage system.
One type of data storage system is a magnetic disk storage system. Here an array or bank of disk drives and the host computer/server are coupled together through a system interface. The interface includes “front end” or host computer/server controllers and “back-end” or disk controllers. The interface operates the controllers in such a way that they are transparent to the host computer/server. That is, data is stored in, and retrieved from, the bank of disk drives in such a way that the host computer/server merely thinks it is operating with its own local disk drive. One such system is described in U.S. Pat. No. 5,206,939, entitled “System and Method for Disk Mapping and Data Retrieval”, inventors Moshe Yanai, Natan Vishlitzky, Bruno Alterescu and Daniel Castel, issued Apr. 27, 1993, and assigned to the same assignee as the present invention.
As is also known in the art, many disk drives used in data storage systems include, in addition to the magnetic storage device, include firmware/processor which monitors the performance and operation of the disk drive. If such firmware/processor detects a fault in such operation, it sets a bit in a register in the disk drive and takes such disk drive in a by-pass state for a short period of time, typically in the order of, for example, 200 milliseconds, thereby disabling it's access by the host computer. More particularly, the system interface includes a diagnostic section (which may be included within the controllers) which regularly polls (i.e., inspects) at a rate of typically 500 milliseconds, for example, the state of the bit register in each of the disk drives. In one system, whenever the diagnostic section detects that the bit register in a disk drive has been set, i.e., the disk drive is in a by-pass condition, such by-pass condition is reported to the system interface control section (i.e., the controllers) thereby advising the controllers to no longer access (i.e., write to or read data from), the by-passed disk drive. It is noted that the diagnostics, when it detects a by-pass condition, i.e., a set bit, does not know whether the by-pass is only temporary or permanent. That is, the diagnostics does not know whether the disk drive will have its by-pass condition removed and thereby again be operational. The polling continues and if the disk drive by-pass condition is removed, the system interface commences a rebuilding of data operation using error correction and detection codes (i.e., a data reconstruction operation). If during the rebuilding process, a new poll indicates that the disk drive is again in a by-pass condition, the system interface must again re-start the data rebuilding process. Further, once the disk drive is placed in a non-accessible condition, the system interface commences the rebuilding of data operation using error correction and detection codes and using a spare disk drive in the array or bank of disk drives, sometimes referred to as a “hot spare” disk drive, to immediately and automatically replace the by-passed disk drive. Thus, once a hot space switches into the system, the data reconstruction must be made using the hot spare before the data can be re-written from the hot spare back into the by-passed, and now perhaps physically replaced disk drive. This process can take from between 30 minutes to perhaps several days. Thus, the possibility of repeated response to by-pass condition bits by the disk drive reduced the efficiency of the data storage system and leaves the data vulnerable to data loss should a second fault occur.