A typical data processing system generally involves one or more storage units which are connected to a host computer either directly or through a control unit and a channel. The function of the storage units is to store data and other information (e.g., program code) which the host computer uses in performing particular data processing tasks.
Various types of storage units are used in current data processing systems. A typical system may include one or more large capacity tape units and/or disk drives connected to the system through respective control units for storing data. However, a problem exists if one of the storage units fails such that information contained in that unit is no longer available to the system. Generally, such a failure will shut down the entire computer system.
This problem has been overcome to a large extent by the use of Redundant Arrays of Inexpensive Disks (RAID) systems. RAID systems are widely known, and several different levels of RAID architectures exist, including RAID 1 through RAID 5, which are also widely known. A key feature of a RAID system is redundancy. The array contains a number of drives, and data is written to the drives in such a way that if one drive fails, the data that was written to the array can still be read. How this redundancy is accomplished depends upon the level of RAID architecture used, and is well known in the art. By way of example, a common architecture that is used is RAID 5. In a RAID 5 system, parity information is calculated from the data that is to be stored. This parity information is written along with the data to the drives in the array. If the drive on which the parity is written fails, the data is still available from the other drives. If a drive on which a portion of the data is written fails, the controller can read the remaining data and the parity information to determine what the missing data is, and recreate or reconstruct the missing data, thus making the data available.
As shown in FIG. 1, a typical RAID system 10 contains a number of separate disk drives 14, 16, 18, 20, 22, which are connected to at least one controller unit 26. It should be understood that the number of drives shown in FIG. 1 are for the purpose of discussion only, and that a RAID system may contain more or fewer disk drives than shown in FIG. 1. The controller unit 26 is connected to a host computer 30, which communicates with the controller unit 26 as if it were communicating to a single drive or other storage unit. Thus, the RAID looks like a single drive to the host computer 30. The controller unit 26 receives read and write commands, and performs the appropriate functions to read and write data to the disk drives 14, 16, 18, 20, 22, depending upon the level of RAID that is implemented in that system.
The disk drives 14, 16, 18, 20, 22 that are in a RAID are generally kept in an enclosure (not shown), which provides power and connection to the drives. The connection between the controller unit 26 and the disk drives 14, 16, 18, 20, 22 is generally a SCSI connection, although other types of connections may be used as well.
The drives 14, 16, 18, 20, 22 within an array each contain metadata. This metadata includes information regarding the RAID system, and also has information regarding the active drives in the array. This metadata is used by the controller to determine which drives are accessible, and therefore the state of the array. If one or more drives in the array have suffered a failure, the metadata contained on the remaining drives is updated to mark these failed drives as bad. If one drive is marked as bad in the metadata, the controller sets the condition of the array to critical, meaning that data can still be read and written, but that if any other drives fail the array will go off-line. To correct this problem, the failed drive must be replaced, and the array rebuilt. When the array is rebuilt, the replacement drive remains marked as bad in the metadata and is only accessed if the requested data has already been rebuilt. The replacement drive remains marked as bad in the metadata and is only accessed if the requested data has already been built. The array must then be reconstructed by writing the data to the replacement drive that was present on the failed drive. If the failed drive contained data, this data is reconstructed using the remaining data and the parity information. If the failed drive contained parity information, this parity information is reconstructed using the data written on the other drives. Once the data and/or parity information for the replaced drive is reconstructed, the array again becomes fault tolerant.
If two or more drives are marked as bad, the controller marks the array as being off-line. This means that data cannot be read from or written to the array. In such a case, the array must be repaired and re-created. When an array is re-created, each drive receives new metadata which shows all drives as being available. The array must then be initialized and the data must be restored to the array from a backup copy, which is typically a time consuming process. This means that data from the RAID system will not be available to the host computer until the restoration from backup is complete. Additionally, any data written to the RAID system subsequent to the latest backup of the system prior to the failure will be lost. Thus, it would be advantageous to have a system which may allow a faster recovery, or which may allow a partial recovery, of data within the system.
Occasionally, a failure may occur which is not a physical failure of the individual drives within the array, but a failure of a component or subsystem which connects the individual drives to the system. Such a failure can be defined as a transient failure. A transient failure may occur in several situations, including an enclosure problem, a controller problem, a SCSI interface problem, or a cabling problem, to name a few. The common element in these failures is that the disk drive itself has not failed or malfunctioned, but is marked as bad in the metadata of the remaining drives. Because the metadata of the remaining drives show a bad drive, the array may be marked as critical, or as off-line, even though the drives marked as bad may in fact not have failed. In such a case, a user must take the same action as described above where there was an actual failure of a disk drive. This means reconstructing data, or restoring data from a backup copy. As described above, this can be a time consuming process resulting in inability to access the array for a period of time, and the possibility of loss of recently written data not contained in the backup copy. Thus, it would be advantageous to have a system and method for restoring an array after such a transient failure which does not require data reconstruction or restoration.
As mentioned above, transient failures may be caused by several events. One event that may cause a transient failure is an enclosure or housing problem. As mentioned above, the drives within the RAID system are typically contained in an enclosure. The enclosure contains a backplane which provides connections to the drives and provides power to the drives. A transient failure may result from a backplane problem within the enclosure. In such a case, the backplane may be damaged or have some type of short, resulting in one or more drives being marked as bad. Additionally, the enclosure may lose power during a write operation. In this case, some drives have new data while some may not have written the new data yet. If this happens, the drive(s) which have not written the new data may be marked as bad drives in the metadata.
Another event causing a transient failure may arise when there is a cabling problem. This can occur if the cables used to connect the system have a failure. For example, a cable may be inadvertently disconnected, damaged in such a way that information may no longer pass through the cable, or some type of short between conductors of the cable can occur. The drives affected by this cabling problem may be marked as bad drives in the metadata.
Another transient failure may occur if the controller unit has a failure. If such a failure occurs, one or more disk drives may be marked as bad. When the controller is repaired the drives may still be marked as bad in the metadata. This may create the necessity of recovering the array using the time consuming techniques described above.
Another transient failure may occur in the SCSI interface located in the enclosure. For example, the termination point of the SCSI connectors may have a failure, or the SCSI chip may fail. Again, in such an event, one or more drives may be marked as bad in the metadata. This may create the necessity of recovering the array as described above once the SCSI problem is corrected.
Accordingly, it would be advantageous to be able to recover a RAID system in the event of a transient failure without the need to recreate the array and restore the data to the array from backup. It would also be advantageous to be able to partially recover data from an array in which more than one drive has failed. It would also be advantageous to be able to recover from a single bad drive transient failure without having to reconstruct the data on that drive.