1. Field of the Invention
This invention relates to data storage systems, and more particularly to data storage systems including multiple disk drives.
2. Description of the Related Art
Data storage systems including multiple disk drives are well known. In general, the reliability of a system is dependent upon the failure rates of hardware system components. As each component has a limited operational lifetime, the reliability of a system requiring all components to remain operational necessarily decreases as the number of components increases. For example, the more disk drives a data storage system includes, the more likely it is that one of the disk drives will fail. When the data storage system requires all of the disk drives to remain operational, the reliability of the data storage system decreases as the number of disk drives increase. Typical redundant array of inexpensive/independent disks (RAID) data storage systems store redundant data to allow the systems to remain operational despite a disk drive failure. As typical RAID data storage systems do not require all of the disk drives to remain operational, system reliabilities are increased.
FIG. 1 is a diagram of one embodiment of a conventional RAID data storage system 10 including a disk array controller 12 coupled to five disk drives 14. In the embodiment of FIG. 1, disk array controller 12 implements a level 4 RAID including data striping and a dedicated parity disk drive storing parity information. Disk array controller 12 divides incoming data to be stored within disk drives 14 into separate data blocks, groups blocks bound for separate disk drives by xe2x80x9cstripe,xe2x80x9d calculates an updated parity block for each updated stripe, and writes the updated data and parity blocks to disk drives 14 in stripe-by-stripe fashion. In FIG. 1, data blocks are denoted using the letter D and parity blocks are denoted using the letter P. Disk array controller 12 calculates a parity block for updated stripes as an exclusive-OR of the data within the four data blocks of the stripes, and writes the updated data blocks and the parity block to disk drives 14 in stripe-by-stripe fashion. For example, where write data includes data bound for data block D1 in FIG. 1, disk array controller 12 forms an updated data block D1, retrieves the contents of data blocks D2-D4, calculates an updated parity block P(D1-D4) using updated data block D1 and the contents of data blocks D2-D4, and writes updated data block D1 and updated parity block P(D1-D4) to disk drives 14. Parity may also be calculated by retrieving the old versions of D1 and P(D1-D4), computing the difference between the old and new versions of D1, recalculating P(D1-D4) from that difference, and writing the new versions of D1 and P(D1-D4).
FIG. 2 is a diagram of a second embodiment of RAID storage system 10 wherein disk array controller 12 implements level 5 RAID including data striping and distributed parity information. As in the level 4 RAID embodiment of FIG. 1, disk array controller 12 divides incoming data into separate data blocks, groups the blocks bound for separate disk drives by stripe, calculates an updated parity block for each updated stripe, and writes the updated data and parity blocks to disk drives 14 in stripe-by-stripe fashion. However, instead of having a single dedicated parity drive as in FIG. 1, storage system 10 of FIG. 2 disperses the parity blocks among the five disk drives 14. This prevents the dedicated parity drive from becoming a performance bottleneck as typically occurs in the level 4 RAID embodiment of FIG. 1.
Important storage system parameters include performance, reliability, and data availability. Data striping is a software technique which improves the performance of a storage system with multiple disk drives by allowing simultaneous access to the disk drives. In addition to configuring a data storage system such that the system does not require all of multiple disk drives to remain operational, adding redundant spare hardware components may improve system reliability. For example, adding a spare disk drive to be substituted for a failed disk drive during system operation may increase system reliability.
Data availability is dependent upon system reliability, and is often increased by adding data redundancy. For example, the parity information generated in RAID levels 4 and 5 for each stripe may be used to recreate any one of the data blocks of a stripe should a disk drive storing a data block of the stripe fail. However, the generation and/or storage of parity information typically has a negative impact on system performance. As described above, the dedicated parity disk drive is typically a performance bottleneck in level 4 RAID. Dispersing the parity blocks in level 5 RAID eliminates the negative impact of the single parity disk drive on system performance. However, even in level 5 RAID, additional read accesses are usually required to calculate parity every time one or more data blocks are modified. Also, conventional level 4 and level 5 RAID systems cannot recover from multiple disk drive failures in the same parity row.
The RAID techniques described above allow data storage system 10 to continue to operate despite the failure of a single disk drive. However, the likelihood of multiple disk drive failures also increases with the number of disk drives in the data storage system. It would thus be beneficial to have a storage system with multiple disk drives which allows continued system operation despite multiple disk drive failures and achieves a satisfactory balance between performance, reliability, and data availability.
A storage system is described including a two dimensional array of disk drives having multiple logical rows of drives and multiple logical columns of drives, and at least one drive array controller configured to store data in stripes (e.g., across the logical rows). A given drive array controller calculates and stores: row error correction data for each stripe of data across each one of the logical rows on one of the drives for each row, and column error correction data for column data grouped (i.e., striped) across each one of the logical columns on one of the drives for each column. The drive array controller may respond to a write transaction (i.e., a write operation) involving a particular row data stripe by calculating and storing row error correction data for the row data stripe before completing the write transaction. In this case, the drive array controller delays calculating and storing the column error correction data for each column data stripe modified by the write transaction until after completion of the write transaction.
The drive array controller may update column error correction data during idle times. Alternately, the drive array controller may update column error correction data periodically. The drive array controller may only update column error correction data for column data stripes modified since the corresponding column error correction data was last updated. The drive array controller may maintain a table of column data stripes modified by a write transaction since the column error correction data was last updated. The row error correction data may include parity data for each row data stripe. Similarly, the column error correction data may include parity data for each column data stripe.
The drive array controller may be configured to recover a failed disk drive. If a given logical row of drives includes only a single failed disk drive, the drive array controller may use row error correction data to recover the failed disk drive. On the other hand, if the logical row of drives includes multiple failed disk drives, and no column stripes have been modified for the logical column including the failed disk drive since the last column error correction data update, the drive array controller may use column error correction data to recover the failed disk drive.
One embodiment of a data storage system includes multiple disk drives logically arranged to form a two-dimensional disk drive array, and a disk array controller coupled to each of the disk drives. The disk drive array has a m+1 rows and n+1 columns where mxe2x89xa72 and nxe2x89xa72. Each disk drive includes q data storage regions where qxe2x89xa71. Each row of the disk drive array includes q row stripes, wherein each row stripe includes a different one of the q data storage regions of each of the n+1 disk drives in the same row. Similarly, each column of the disk drive array includes q column stripes, wherein each column stripe includes a different one of the q data storage regions of each of the m+1 disk drives in the same column.
The disk array controller may be configured to group data to be stored in a given row stripe during a write operation, to calculate a row parity block for the given row stripe, and to calculate a column parity block for each column stripe modified during the write operation. In this case, the disk array controller is configured to either: (i) calculate the row parity block for the given row stripe before completing the write operation, and delay the column parity block calculations until after the completion of the write operation, or (ii) calculate the column parity blocks before completing the write operation, and delay the row parity block calculation until after the completion of the write operation.
The m+1 rows include a first m rows and a last row, and the n+1 columns include a first n columns and a last column. In one embodiment, the first m rows and the first n columns may include data disk drives for storing data. The last row includes n parity disk drives for storing column parity information, and the last column includes m parity disk drives for storing row parity information (e.g., as in level 3 or level 4 RAID). In this embodiment, a disk drive location in the last row and the last column may be empty. In other embodiments, parity information may be dispersed among the disk drives of the disk drive array (e.g., as in level 5 RAID), and an operational disk drive may be located in the disk drive location in the last row and the last column.
One of the data storage regions in each row stripe may be used to store a row parity block, and the remaining n data storage regions of each row stripe may be used to store data blocks. The row parity block may be dependent upon the n data blocks of the row stripe. For example, the disk array controller may calculate the row parity block for a given row stripe as an exclusive-OR of the contents of n data blocks of the row stripe.
Similarly, one of the data storage regions in each column stripe may be used to store a column parity block, and the remaining m data storage regions of each column stripe may be used to store data blocks. The column parity block may be dependent upon the m data blocks of the column stripe. For example, the disk array controller may calculate the column parity block for a given column stripe as an exclusive-OR of the contents of m data blocks of the column stripe.
The disk storage system may include a memory for storing information used to track the parity block calculations delayed by the disk array controller. The memory may be a non-volatile memory, and may reside within the disk array controller.
One method for storing data within the disk drive array described above includes grouping data to be stored in a given row stripe during a write operation. A row parity block is calculated for the given row stripe. The data and the associated row parity block are written to the corresponding disk drives of the array, thereby completing the write operation. Following completion of the write operation, a column parity block is calculated for each column stripe modified during the write operation. The column parity blocks are written to the disk drives storing parity blocks for the modified column stripes.
Another method for storing data within the disk drive array described above includes grouping data to be stored in a given row stripe during a write operation. A column parity block is calculated for each column stripe modified during the write operation. The data and the column parity blocks are written to the corresponding disk drives of the array, thereby completing the write operation. Following completion of the write operation, a row parity block is calculated for the given row stripe. The row parity block is written to the disk drive storing the row parity block for the given row stripe.
One method for repairing the above described disk drive array following failure of one or more of the disk drives includes selecting a failed disk drive. A determination is then made as to whether or not data stored in the selected failed disk drive is recoverable using either: (i) data stored in other disk drives in the row in which the selected failed disk drive resides, or (ii) data stored in other disk drives in the column in which the selected failed disk drive resides. The following steps are performed if the data stored in the selected failed disk drive is recoverable: (i) recovering the data stored in the selected failed disk drive, and (ii) writing the recovered data to a spare (or repaired or replacement) disk drive, and (iii) replacing the selected failed disk drive in the disk drive array with the spare (or repaired or replacement) disk drive. The above steps may be repeated until all failed disk drives have been repaired or replaced by new ones.
When the selected failed disk drive resides in a given row of the disk drive array, the data stored in the selected failed disk drive may be recoverable using data stored in other disk drives in the given row unless: (i) any of the other disk drives in the given row is not operational, or (ii) row parity data stored in any of the other disk drives in the given row is not current. Similarly, when the selected failed disk drive resides in a given column of the disk drive array, the data stored in the selected failed disk drive may be recoverable using data stored in other disk drives in the given column unless: (i) any of the other disk drives in the given column is not operational, or (ii) column parity data stored in any of the other disk drives in the given column is not current.