This invention relates generally to storage systems associated with computer systems and more particularly to providing a method and apparatus for providing a dynamic sparing in a RAID storage system. That is, the method and apparatus provide a means for dynamically converting a storage system from a RAID configuration to a mirrored configuration.
As it is known in the art, computer systems generally include a central processing unit, a memory subsystem and a storage subsystem. According to a networked or enterprise model of a computer system, the storage subsystem associated with or in addition to a local computer system, may include a large number of independent storage devices or disks housed in a single enclosure. This array of storage devices is typically connected to several computers (or hosts) via dedicated cabling or via a network. Such a model allows for the centralization of data which is to be shared among many users and also allows a single point of maintenance for the storage functions associated with the many computer systems.
One type of storage system known in the art is one which includes a number of disk storage devices configured as an array (sometimes referred to as a RAID array). Such a system may include several arrays of storage devices and thus provide storage services for many host computers. Alternately, a single storage system may store massive amounts of data for a single host computer or even a single application program. With such systems often there is a need and expectation that the data stored on the disk devices of the storage system be available twenty four hours a day, seven days a week. Such a requirement places a heavy burden on the storage system in terms of reliability.
One method of achieving reliability is through the use of various RAID configurations. In at least one RAID configurations, part of the available storage capacity within a storage system is used to store parity information. The parity information may be generated using the stored data. The data may be spread across several disks with the parity information residing on yet another disk. The data and associated parity storage are typically known as a RAID group. With this arrangement, data associated with a failing or failed disk device may be constructed using the remaining data and the parity information. One of the drawbacks of such a system is the increased processing time to maintain the parity information for each transaction. Another drawback is the loss of storage capacity due to the storage of the parity information.
Another method of achieving reliability it to provide a local or remote mirror of each storage device within the system. With such an arrangement, each time the host writes to the storage system, the data is stored in an active and a backup storage device. Should one of the active devices fail, the backup device may be seemlessly substituted thus providing uninterrupted service. One of the drawback of the mirroring solution is the additional storage devices required to provide the mirroring function.
Yet another method of providing storage system reliability is through the use of so called dynamic sparing. Dynamic sparing may be thought of as a cousin of the mirror solution in that a complete copy of a failed device is substituted for the failed device. The difference between sparing and mirroring is that with a sparing solution, data is not always written to both the active and backup storage devices. Dynamic sparing operates by sensing when a particular storage device is beginning to fail. Failure may be indicated if a particular device begins to report an unacceptable amount of I/O errors. When this condition is sensed by the storage system, it begins copying all data from the failing device to a backup device. The backup device, which has been idle until this point, will then replace the failing drive when all data is copied. Thus, potential storage system unavailability may be avoided.
Storage system customers typically choose one of the above schemes for achieving some level of fault tolerance. A drawback with each of the reliability solutions described above is that until the failed device is repaired or replaced, the storage system will typically be left operating in a state where no additional failures may be tolerated. That is, if a second device were to fail before the first failing device were replaced, the data would no longer be available to the host computers. This level of uncertainty may not be acceptable when critical data is being stored within a storage system.