1. Field of Invention
The present invention relates generally to the field of storage system maintenance. More specifically, the present invention is related to a method for providing deferred maintenance on storage subsystems.
2. Discussion of Prior Art
Storage needs of typical datacenters have been increasing for some time and are expected to continue to increase in the foreseeable future. Even though the capacity of disk drives has been increasing, the increase in storage requirements results in having more disk drives and more storage systems in a datacenter.
The typical service model for most storage systems is an on-demand model—when a disk drive fails, one of the available spares is used to rebuild the data of the failed drive. In the meantime, a service engineer who is also notified about the failure, schedules a visit, usually within 24 hours, to replace the failed drive with a new one, which now becomes the new spare. (In some implementations, the new drive is inserted into the array after copying it with data from the spare, thereby returning the spare back to the pool.) For datacenters with several disk drives (500 or more), the probability that there is at least one failed disk drive at any point in time is fairly high.
The on-demand maintenance model has the following adverse effects on large installations—high service costs associated with service engineer visits for replacing drives, and the possibility that the wrong drive could be pulled out of the system.
In order to reduce costs associated with on-demand maintenance of disk drives, a deferred maintenance model is sometimes used. In this case, a pool of spare disk drives is created so that multiple spares are available. When a disk drive fails, one spare from the pool is used for replacement. If a second drive fails, another one from the pool can be used. Service action is required only when the number of spares falls below a minimum.
Rather than scheduling a visit for each individual drive failure, the service engineer can bunch multiple replacement actions into a single visit and minimize service costs. Such a scheme can work with large storage systems with several disk drives where the pool of spares can be used for replacement across all the drives connected to the system.
FIG. 1 illustrates a large storage system, wherein such a large storage system is composed of 100-500 drives. In such a storage system, a large pool of spare drives can easily be created in order to defer maintenance actions. The spare drives can be accessed by the controllers in order to replace any of the drives that have failed within the storage system. It should be noted that the two controllers depicted in FIG. 1 are typically used in an active-active configuration, i.e. under normal conditions, both controllers actively manage the drives, but should one controller fail, the other manages the entire set of drives until the failed controller is replaced. The drives are all configured using a RAID scheme so that on a drive failure, there is no data loss. However, a spare needs to be brought into the array with the failed drive in order to replace it. The array is operational during the failure and the rebuild phase where data corresponding to the failed drive is being rebuilt onto the spare. However, if it loses another drive (for RAID schemes that can tolerate only one failure such as RAID 5 and RAID 1), there will be data loss. Once data has been rebuilt on the spare, the array can tolerate the failure of another drive.
However, in installations with several storage subsystems, each having a limited number of disk drives, a large pool of spares that can be used by all the storage subsystems cannot be created as each storage subsystem can only use disk drives that are directly connected. Such installations have applications in environments where the storage subsystems lack scalability because of cost considerations, or lack the performance to handle large numbers of disk drives.
Such a small storage system is depicted in FIG. 2, which is only composed of, for example, 10-30 drives. In such a system, creating a large pool of spares will imply that a large fraction of the drives will not be used for active storage lowering the utilization and increasing the cost.
Accordingly, it is necessary to develop a scheme that can bring the benefits of deferred maintenance to installations where there are several storage subsystems, each having a limited number of disk drives.
Whatever the precise merits, features, and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention.