The present invention relates to performance enhancements for RAID storage systems and more particularly to a method and system for enhancing performance of non-full stripe writes by utilizing parity caching and parity logging.
A typical data processing system generally involves one or more storage units which are connected to a host computer either directly or through a control unit and a channel. The function of the storage units is to store data and other information (e.g., program code) which the host computer uses in performing particular data processing tasks.
Various types of storage units are used in current data processing systems. A typical system may include one or more large capacity tape units and/or disk drives connected to the system through respective control units for storing data. However, a problem exists if one of the storage units fails such that information contained in that unit is no longer available to the system. Generally, such a failure will shut down the entire computer system, which can create a problem for systems which require high availability.
This problem has been overcome to a large extent by the use of Redundant Arrays of Inexpensive Disks (RAID) systems. RAID systems are widely known, and several different levels of RAID architectures exist, including RAID 1 through RAID 5, which are also widely known. A key feature of a RAID system is redundancy, which is achieved through the storage of a data file over several disk drives and parity information stored on one or more drives. While the utilization of a RAID system provides redundancy, having to break a file into several parts, generate parity data, and store the file and parity on the array of disk drives can take a significant amount of time. It is therefore advantageous to have a RAID system that can process the data and generate parity information quickly to provide enhanced performance.
With reference to FIG. 1, a typical RAID system 10 contains a host computer 12, at least one controller 14, and a number of disk drives 16, 17, 18, 19. It should be understood that the number of drives shown in FIG. 1 are for the purpose of discussion only, and that a RAID system may contain more or fewer disk drives than shown in FIG. 1. Data is written to the drives 16, 17, 18, 19 in such a way that if one drive fails, the controller can recover the data written to the array. How this redundancy is accomplished depends upon the level of RAID architecture used, and is well known in the art.
The controller 14 is connected to the host computer 12, which communicates with the controller unit 14 as if it were communicating to a single drive or other storage unit. Thus, the RAID looks like a single drive to the host computer 12. The controller unit 14 receives read and write commands, and performs the appropriate functions required to read and write data to the disk drives 16, 17, 18, 19, depending upon the RAID level of the system. Typically, when the host computer 12 issues a write command, the controller unit 14 receives this command and stores the data to be written in a memory location, and sends a reply to the host computer 12 that the write is complete. Thus, even though the data may not have been written to the disk drives 16, 17, 18, 19, the host computer 12 believes that is has been written. The controller unit 14 then takes appropriate steps to process and store the data on the disk drives 16, 17, 18, 19. It is common for a controller unit 14 to receive several write commands before completing the first write command, in which case the other write commands are placed in to a queue, with the data written for each command as the controller unit 14 works through the queue.
When storing data, generally, the controller 14 receives the data and breaks the data down into blocks which will be stored on the individual disk drives 16, 17, 18, 19. The blocks of data are then arranged to be stored on the drives 16, 17, 18, 19. In arranging the blocks of data, the controller 14 organizes the blocks into stripes and generates a parity block for each stripe. The data blocks are written across several drives, and the parity block for that stripe which is written to one disk drive. In certain cases, the data may not be large enough to fill a complete stripe on the RAID system. This is known as a non-full stripe write. When the data sent to the controller occupies a full stripe, the data is simply written over existing data and the parity is written over the existing parity. Additionally, in certain cases, the controller may aggregate several small writes together to create a full stripe of data, which the controller treats as a fill stripe of data for purposes of generating parity. However, in the case of a non-full stripe write, modifying the stripe of data requires several steps, and is a disk intensive activity.
The occurrence of non-full stripe writes is common in many applications, such as financial, reservation and retail systems, where relatively small data records are widely used and are accessed and modified at random. When an individual customer record needs to be revised, it may reside in a stripe of data that contains several other customer data records. In such a case, only a portion of the stripe needs to be modified, while the remainder of the stripe remains unaffected by the modification of the data.
With reference to FIG. 2, a flow chart representation for completing a non-full stripe write operation is shown. First, shown in block 20, the RAID controller receives new data to be written to the disk array. Next, shown in block 24, the controller reads the old data from the disk array. Then, the controller reads the old parity from the disk array, shown in block 28. The two reads of blocks 24 and 28 may be issued by the controller at the same time, however, they may finish at different times, depending upon several factors, such as other disk activity. Next, shown in block 32, new parity is created by XORing the old data and old parity with the new data. This results in new parity that is generated for the stripe of data. The controller then writes the new data to the disk array, shown in block 36. Finally, the controller writes the new parity to the disk array, shown in block 40. Like the read commands above, the two write commands may be issued by the controller at the same time, although they may finish at different times. When both writes complete, the parity and data are consistent on the disk. When the data has been written but the parity write has not completed, then the parity is not consistent on the disk array. The same is true when the parity has been written but the data write has not completed.
This non-full stripe write process is slow and opens the possibility for data corruption. Normal RAID input and output operations can handle any single failure, but data corruption can occur if there is a double failure. For example, a single failure may be a failure of one hard drive. In such a case, the controller will detect the hard drive failure and operate the RAID system in a critical mode, using the parity drive to generate missing data located on the failed hard drive. Likewise, a single failure may be a controller rebooting, which may occur when all of the data has not been written to the disk drives yet. In this case, the data will still be in the controller non-volatile memory, and when the controller initializes, it will detect the data in memory, compute parity and write the data and parity to insure that the data and parity are valid on the disk drives. If the controller reboots after the data has been written, but prior to the parity being written, the parity for the stripe may be recomputed using the data in that stripe. A double failure is a situation where the controller reboots and then a hard drive fails while the parity and data are not consistent. In such a situation, customer data may be lost, because the controller will not be able to recompute the parity for a stripe due to a disk drive being unavailable. This results in what is termed or defined as the RAID 5 write hole.
One method for closing the RAID 5 write hole, shown in FIG. 3, is to log all outstanding parity writes. In such a case, the controller receives data to be written to the disk array, shown in block 44. The controller next opens a write operation, shown in block 45. The controller then initiates the write operation, shown in block 47. The controller then reads the old data from the disk array, shown in block 48. Next, the controller reads the old parity from the disk array, shown in block 52. As above, the commands for reading the old data and old parity may be issued at the same time, although they may complete at different times. New parity is then computed by XORing the old data and old parity with the new data, shown in block 56. Next, the controller opens a parity log in non-volatile memory showing an outstanding data and parity write, shown in block 60. The parity log contains pointers to the location of the parity data and user data, the location in the drives where the data will be stored, the serial number for the drives being written, the serial number of the array the drives belong to, and an array offset. The controller then writes the new data and new parity to the disk array, shown in block 64. Once the data and parity writes are complete, the controller invalidates the parity log by marking the array offset with an invalid number and terminates the write operation, shown in block 68.
With reference to the dashed lines in FIG. 3, additional steps for data mirroring are now described. Once a write operation is opened, the new data is mirrored to the other controller, shown in block 46. The next mirroring operation occurs following the opening of the parity log, where the controller mirrors the parity and parity log to the other controller, shown in block 62. The next mirroring operation occurs following the data and parity write to the disk array, when a write operation termination command is mirrored to the other controller, shown in block 66
If there is a double failure before the parity log is invalidated, the write commands may be reissued for the data and parity referenced by the parity log. Thus, even in the event of a double failure, the RAID5 write hole is closed. In such a case, the parity log is stored in nonvolatile memory, and removed from nonvolatile memory once the parity log is invalidated. While this method is successful in closing the RAID5 write hole in allowing recovery from a double failure, it is still a disk intensive activity. For each write operation, there are two read commands, and two write commands, one for both the data and the parity for each of the read and write commands. These read and write commands take significant time to complete, and consume channel bandwidth to and from the disk array.
One method for enhancing the performance for non-full stripe writes is to cache the parity. Such a method, shown in FIG. 4, is useful when there are multiple non-full stripe writes to a single stripe. In such a case, the controller receives new data to be written to the disk array and stores the data in temporary nonvolatile memory, shown in block 72. The controller then opens a write operation, which contains a pointer to the location of the new data in temporary memory, shown in block 76. The write operation is placed into a queue, and remains there until any write operations that were before it in the queue are completed. Next, the write operation is initiated, shown in block 80. The controller then reads the old data from the disk array, shown in step 84. The controller then determines whether the old parity is cached, shown in step 88. This determination is made by checking the parity cache in nonvolatile memory for parity which matches with the parity for the stripe of data that was read in block 84. If no parity is present in the parity cache that corresponds with the stripe of data being modified, the controller reads the old parity from the disk array, shown in block 92. If parity is present in the parity cache that corresponds to the stripe of data being modified, the controller reads the old parity from the parity cache, shown in block 96. Once the old parity is read, the controller modifies the parity by XORing the old data and the cached parity with the new data, shown in block 100. The controller then determines whether another write operation in the write operation queue is for the current stripe of data, shown in block 104. If there is not another write operation for the same stripe of data, the controller writes the new data and the new parity to the disk array, shown in block 108. If there is another write operation for the current stripe of data, the controller then caches the new parity in place over the cached parity, if there was a prior parity cache, shown in block 116. The controller then writes the new data to the disk array, shown in block 120. Finally, the controller then terminates the write operation, shown in block 112.
This method enhances the performance of non-full stripe writes by reducing the number of reads and writes of the parity for a particular stripe of data. For example, if there were three write operations for a stripe of data, the parity would be read from the disk array only once, and written to the disk array only once. This decreases the amount of time required to complete the data writes, and also consumes less channel bandwidth as compared to reading the parity from disk and writing the parity to disk for each write operation. However, when caching parity in such a manner, it becomes difficult to log the parity. This is because each parity log is associated with a write operation. When the write operation is terminated, the parity log is also closed. This can happen even though the new parity has not been written to the disk array. If there is a failure of a single hard drive, the controller can write any existing cached parity and the array will become critical. If there is a controller failure, such as a power loss which requires the controller to reboot, the controller can recover by checking the nonvolatile memory for any outstanding write operations and any parity logs. If the controller failure occurs during a write operation, a parity log may still be open. Thus, the controller can recover both the new data, and modify the parity for that stripe of data by reading the old data and old parity, XORing the new data with the old data and old parity, and write both the new data and the new parity.
However, if the controller failure occurs after a write operation is terminated, the parity log will be closed, and when the controller reboots it will not find a parity log. If the parity for that stripe of data was cached, the data and parity on the disks would not be consistent. If all of the disk drives are operating, after recovery the controller will resume normal read and write operations and will re-initiate the write operation which caused the parity to be cached, and normal caching operations would be resumed, thus the inconsistency of the parity and data would not be a problem. A problem can arise, however, if the system has a double failure, such as a controller failure and a hard drive failure. If such a double failure occurs, the system may not be able to correct itself as with a single failure. For example, if the double failure occurred, and the parity for a stripe of data was cached, the data and parity on the disks would not be consistent. Unlike the single failure of a controller situation above, when the write operation that caused the parity to be cached is initiated, the old data may not be able to be read from the disk array without using parity to regenerate a block of data in the stripe. Because the data and parity are inconsistent for that stripe, the data will not be valid. Thus, parity caching may reopen the RAID 5 write hole.
While the method of FIG. 4 is useful to increase performance of a RAID system, it has additional drawbacks. As is well understood, many RAID systems require very high availability. In order to increase availability of a RAID system, redundant controllers are commonly used in such a system. The performance of the system can also be enhanced by having two active controllers, known as an active-active system, which may employ one of a number of zoning techniques where one controller communicates primarily with a set of storage devices or host computers. In such a redundant system, if one controller fails, the remaining controller fails over and assumes control for all of the functions of the failed controller, and the RAID system may continue to operate without interruption. In order to make sure that data is not lost in such a situation, data is mirrored between the two controllers. This mirrored data includes the data that is to be written to the disk array, as well as any outstanding logs which may be present. With reference to the dashed lines of FIG. 4, additional steps for data mirroring are now described. Once a write operation is opened, shown in block 76, the data is mirrored to the other controller, shown in block 78. The next mirroring operation occurs following the computation of new parity in block 100. Once the new parity has been computed, the new parity is mirrored to the other controller, shown in block 102. The final mirroring step occurs following the write of new data to the disk array, and includes a command to the other controller to terminate the write operation, shown in block 110.
If one controller fails, the remaining controller fails over, using the mirrored data, and completes any outstanding write operations using the mirrored data. However, if parity caching is to be used to enhance performance of the RAID system, the prior art method as shown in FIG. 4 will not provide an adequate solution. If a controller fails after the a write operation is terminated, the log associated with the write operation will also be terminated. Therefore, the other controller is not aware that there was a parity cache in the other controller. Thus, if the controller containing the cached parity fails after a write operation has been terminated and parity is cached, the other controller will not recompute the parity for the associated stripe of data. As described above, if there is a single failure the system is able to recover, but in the case of a double failure the system will not be able to recover. Thus, in such instances, there may be an inconsistency between the data and the parity for that stripe. Thus, it would be advantageous to have a high availability redundant system in which both controllers are aware that parity is cached.
The present invention provides a system and method for parity caching while closing the RAID 5 write hole. The system includes an array of drives that stores data and parity including first data and second data. The system also includes a cache memory, and at least one controller which communicates with the array of drives and the cache memory. The controller controls the storage of the first data as part of a first write operation and controls the storing of first parity information related to the first data. During the first write operation, the controller also controls whether the first parity is written to disk or stored in the cache memory during the first write operation. The controller also controls providing a parity log related to the first data before writing the first data to disk. When the first parity is written to cache memory, the controller controls providing a parity log related to the second write operation before starting the second write operation and before terminating the first write operation. This parity log contains a pointer to the cached parity. The parity log related to the second data is provided before invalidating the parity log related to the first data. The controller controls invalidating the parity log related to the second data after completion of a data write associated with the second write operation. In one embodiment, the controller controls starting the second write operation with the reading of previously stored data, and after receiving the second data that is to be written to the array of drives. In another embodiment, the second write operation starts when a continuous, uninterrupted sequence of steps begins which results in the second data being written to the array of drives. In another embodiment, the second write operation starts at least before the modification of previously stored parity (e.g., first parity) begins to create second parity.
The method for parity caching includes storing first data on the array of drives as part of a first write operation. During the first write operation, the controller reads the existing data and parity from a first stripe of data on the array of drives. The controller then modifies the parity to provide first parity and opens a parity log associated with the first write operation. The first parity is received in non-volatile cache memory associated with the controller. The controller then determines if a second write operation, which follows the first write operation, is for the same stripe of data. If the second write operation is for the same stripe of data, the controller stores the first data on the array of disks and opens a parity log associated with the second write operation. The controller then invalidates the parity log associated with the first write operation and terminates the first write operation.
The controller then stores second data to the array of drives as part of a second write operation. During the second write operation, the controller reads the existing data from the array of disks, and the first parity from the cache memory. The controller computes the parity, and the first parity is replaced in the cache memory with second parity. The controller then checks to verify that a third write operation is for the same stripe of data, and if so caches the second parity and opens a parity log for the third write operation. If the third write operation is for a different stripe of data, the second write operation writes the second parity to the array of disks and invalidates the parity log associated with the second write operation.
In the event of a double failure associated with a controller and one of the drives in the array of drives, a recovery is performed using the parity log. In one embodiment, the recovery is by a second controller which accesses the parity log and determines the second parity using the parity log. In another embodiment, the second controller obtains the cached parity from a cache memory associated with the second controller.