1. Field of the Invention
This invention relates to an electronic storage array and particularly to maintaining data in an electronic storage array during multiple, concurrent drive failures within the electronic storage array.
2. Description of the Related Art
One requirement of contemporary distributed data storage systems, such as a redundant array of independent disks (RAID) storage system, is to try to maintain data availability throughout a large number of drive failure scenarios. In one scenario, it is important to maintain data access during multiple drive failures within a single volume set. A volume set is a collection of user data extents presented to an operating system as a range of consecutive logical block addresses (LBAs).
Each volume set may be protected using one or more different RAID levels. Commonly known RAID levels include, for example, RAID 0 (data striping), RAID 1 (disk mirroring), RAID 0+1 (data striping and disk mirroring), RAID 2 and 3 (parallel processing), RAID 4 (parity disk), RAID 5 (parity striping). RAID 1, RAID 0+1, and RAID 5 are commonly employed in distributed data storage systems. However, these data storage and access structures can generally only tolerate a single drive failure and still be able to provide complete access to the user data. If more than one drive fails at a given time, it may become extremely difficult or even impossible to recover the data from the damaged drives.
Today, several different approaches may be taken when protecting the user data in a distributed data storage system. One approach attempts to prevent a user from ever being at risk of losing data. This is accomplished by placing the volume set in a read-only mode when the volume set is in a critical state because a disk failure is detected. The user is not allowed to write data to the critical volume set while in this state, which persists until the data on the failed drive can be recovered and the failed drive can be rebuilt. The intention of this approach is to limit the amount of time that the distributed data storage system is exposed to multiple disk failures. Unfortunately, in the event of a second concurrent disk failure, the user data is lost and cannot be recovered. This is because the data from the second drive is required to recover the data from the first drive, and becomes unavailable due to the second drive failure.
Another known approach to dealing with a drive failure in a distributed data storage system is to allow the user to continue to access the data in a limited manner during multiple drive failures (as long as the failures are not complete and catastrophic failures). During the period of multiple failures, this approach attempts to keep track of the data that is in error, but still allows access to the data.
This approach presents a significant problem with regard to new data that should, but cannot be written to the critical volume set due to the drive failure. For example, the data may be cached in the storage system controller, but cannot be written to the failed target disk within the volume set. One solution is to “pin,” or hold, the write data in the cache until the user either reads the data back or executes a specific command to clear the pinned data. Pinning the write data in the cache prevents the loss of any data that is already written to cache and, if the user is prevented from writing any additional data, will protect the volume set to the greatest possible extent. However, this approach is limited in the amount of data that may be pinned in the cache. Consequently, this approach may not work well when the amount of pinned data becomes larger than a small percentage of the overall available cache, because the system still needs to continue to operate with the other non-critical volume sets. Storing large amounts of pinned data in the cache may adversely affect non-critical volume sets that do not have failed drives.
Consequently a need exists for an apparatus, system, and process for maintaining data in an electronic storage array during multiple drive failures. Beneficially, such an apparatus, system, and process would allow read and write access with the critical volume set during a first drive failure and would allow read-only access during multiple drive failures. The read-only access would also preferably provide access for data recovery for the first failed drive even after the failure of a subsequent drive.