As is known in the art, large host computers and servers (collectively referred to herein as “host computer/servers”) require large capacity data storage systems. These large computer/servers generally include data processors, which perform many operations on data introduced to the host computer/server through peripherals including the data storage system. The results of these operations are output to peripherals, including the storage system.
One type of data storage system is a magnetic disk storage system. Here an array of disk drives and the host computer/server are coupled together through an interface. The interface includes “front end” or host computer/server controllers (or directors) and “back-end” or disk controllers (or directors). The interface operates the controllers (or directors) in such a way that they are transparent to the host computer/server. That is, data is stored in, and retrieved from, the array of disk drives in such a way that the host computer/server merely thinks it is operating with its own local disk drive. One such system is described in U.S. Pat. No. 7,007,194, entitled “Data Storage System Having Point-to-Point Configuration”, Wilson et al., issued Feb. 28, 2006, assigned to the same assignee as the present invention, and incorporated herein by reference in its entirety.
In the current practice, disk drives are installed in the array by updating a configuration file in the system, physically installing drives in the correct locations, and performing initialization routines to properly format the drives to accept user data. Once placed into the system, these new drives are considered fully capable, operational units, and if they are unable to complete the initialization commands properly, then they are diagnosed as bad, and the installation is considered a failure, since the physical configuration does not match the expected configuration due to the missing units.
New drives may fail the installation process for various reasons: there may have been handling damage between the factory and the customer location, the format may be incorrect, there may be a previously undetected fault within the drive, or a software bug may be present. The existing process is unable to cope with any of these potential problems, the normal recourse being to order a replacement drive for the failed unit and repeat the process once the replacement has arrived. This is a time-consuming and expensive process.
Once successfully installed, the drives will provide their expected functions through their normal lifetime. Over time, however, some of the drives will encounter errors. If the errors are serious enough, the policies in the array will choose to stop using some of these drives. The current practice for high-availability systems is to repair such a failure in a minimum amount of time, in order to minimize the time at which the affected part of the system runs “exposed”, i.e., the design level of redundancy is temporarily lost, and if another failure occurs within the repair window, the user may experience a disruption in access to this data. To minimize the repair window, the system may be configured with one or more spare drives that is available to be used to replace any other failed drive in the system, and this spare drive is invoked automatically and immediately upon detection of a failure. Even so, once the spare is consumed, the system must be repaired by replacing the failed drive to return the system to normal levels of redundancy and protection. As the cost of hardware drops in terms of unit cost, the relative cost of servicing the failure increases over time.
Another issue worth mentioning is that the diagnosis of a drive problem must be done while the drive is still considered part of the system, placing severe constraints on the types of operations, response times, and decision time for determining whether to continue to utilize the drive or not, since the host system software places tight constraints on responses to its commands.
tegory of errors is those which may be transient in nature, or may be caused by software bugs within the drive itself. These errors may masquerade themselves as hardware errors, and once the drive is replaced and returned to the factory, a retest may find that there is nothing permanently wrong with the drive. This process adds unnecessary cost and wear and tear, and can expose the system to other errors, since there are well known error rates for the service actions themselves.
One way to reduce the error rate due to software problems discovered within the drive subsystem would be to update the software periodically, however, in a high-availability system this is difficult to do, since the drive software update process requires that the drive be made unavailable to the user for a period of time while the software is upgraded.