1. Field of the Invention
The present invention relates to data storage systems, and in particular, to a method and apparatus for supporting parity protected RAID in a clustered environment.
2. Description of the Related Art
The ability to manage massive amounts of information in large scale databases has become of increasing importance in recent years. As businesses begin to rely more heavily on large scale database management systems, the consequences of hardware-related data losses intensify, and the security, reliability, and availability of those systems becomes paramount.
One way to increase the reliability and availability of data stored in large databases is to employ a technology known as a redundant array of inexpensive/independent disks, or RAID. This technique is described in the paper “A Case for Redundant Array of Inexpensive Disks (RAID),” by David A. Patterson, Garth Gibson, and Randy H. Katz, and given at the A CM Sigmiod Conference 1988, pages 109-116 (1988), which is herein incorporated by reference.
One or more RAID systems provide for fault tolerance using parity. Parity calculates the data in two drives and stores the result (a bit from drive 1 is XOR'd with a bit from drive 2, and the result is stored). Accordingly, parity is the XOR of member data and if data is dropped (e.g., if a disk dies), the data can be recovered by the using parity.
At least five RAID “levels” have been defined. RAID-0 writes/interleaves data across the drives in the array, one segment at a time. This is also referred to as a “striped” configuration. Striping offers high I/O rates since read and write operations may be performed simultaneously on multiple drives. RAID-0 does not increase reliability, since it does not provide for additional redundancy.
RAID-1 writes data to two drives simultaneously. If one drive fails, data can still be retrieved from the other member of the RAID set. This technique is also known as “mirroring.” Mirroring is the most expensive RAID option, because it doubles the number of disks required, but it offers high reliability. Additionally, the cost ratio of fault tolerant storage is high.
In RAID-2, each bit (rather than bytes or groups of bytes) of a data word is interleaved/written across the drives in the array. Hamming error correcting code (ECC) is recorded on an ECC disk When the data is read, the ECC verifies the correct data or corrects single disk errors.
In RAID-3, the data block is striped and interleaved/written across the disks in the array. Parity bits (also referred to as stripe parity) is generated when data is written to the disks, recorded on a separate dedicated parity disk, and checked on read operations. RAID-3 provides high read and write transfer rates, and a low ratio of parity disks, but can yield a transaction rate that does not exceed that of a single disk drive. The controller implementing a RAID-3 array maybe implemented in hardware or software. Software RAID-3 controllers are difficult to implement, and hardware RAID-3 controllers are generally of medium complexity.
In RAID-4, each entire data block is written on a data disk Parity for blocks of the same rank are generated for data writes and recorded on a separate dedicated parity disk The parity data is checked on read operations. RAID-4 provides a high read data transaction rate, but can require a complex controller design. RAID-4 arrays generally have a low write transaction rate and it can be difficult to rebuild data in the event of a disk failure.
In RAID-5, each data block is striped across the data disks in the array. Parity for blocks in the same rank is generated on write operations, and recorded in locations distributed among the disks in the array. Parity is checked during read operations. RAID-5 is similar to RAID-3, except that the parity data is spread across all drives in the array. RAID-5 offers high read transaction rates. Disk failures can compromise throughput, however, and RAID-5 controllers can be difficult to implement.
An additional RAID level is RAID-6 which is similar to RAID-5 but two different parity computations or the same computation is performed on overlapping subsets of the data. RAID-6 has the highest reliability, but is not widely used due to the difficulty in implementation and double parity computations.
When shared parity protected RAID data (e.g., RAID-4 data, RAID-5 data, RAID-6 data, and their variations) is supported by multiple nodes (either with RAID adapters or software RAID) in a cluster, the RAID parity update may be incorrect when two or more nodes change different data items in the same RAID stripe at about the same time. In other words, when there is more than one master managing the storage array in a RAID, there can be problems. Data integrity may be lost if updates to panty are not synchronized properly. Cache memory holds the data between adapters and disks, including parity data. Further, cache memory is usually used in a RAID-5 controller to improve write performance by capturing all the data needed for a stride write.
For example, suppose D0 and D1 are two distinct data items in the same RAID-5 stripe and they are protected by the same parity P. Node N0 updates D0 and P; node N1 updates D1 and P. In order to update P, both N0 and N1 need to read and write P. The operations are referred to as R0/W0 and R1/W1 for the parity read/write operations performed by N0 and N1, respectively. Depending on the order R0, W0, R1, and W1 are performed, the result may be incorrect. For example, if the order is R0, R1, W0, W1, the parity does not include D0 changes, hence, is incorrect.
One prior art solution used for ensuring and synchronizing parity update is for each participating node to maintain lock information on the other adapters/nodes. Such a solution is more fully described in co-pending U.S. patent application Ser. No. 09/127,472, entitled “Disk Arrays Using Non-Standard Sector Sizes”, by Jaishankar M. Menon, et. al., Attorney Docket No. AM9-98-025, filed on Jul. 31, 1998, which application is hereby fully incorporated by reference herein. However, when the number of nodes increases, each node must maintain the lock information which grows with the number of nodes. Consequently, such a solution does not scale well with a larger number of nodes, and cannot handle the addition or deletion of nodes gracefully. Further, to maintain such lock information on participating nodes, the nodes must maintain the ability to communicate with each other. For example, in the prior art, SSA (serial storage architecture) RAID adapters comprise two (and only two) adapters communicating to a disk and with each other via a separate communications channel. However, in small computer system interface (SCSI) systems (a system commonly utilized in RAID systems), such inter-adapter communication does not exist in a non-proprietary way.
Accordingly, what is needed is a storage system and method for managing and updating the parity for RAID data in a SCSI system that scales well for a large number of nodes and handles node addition/deletion gracefully.