The present invention relates to performance enhancements for redundant array of inexpensive disks (RAID) storage systems and more particularly to a method and system for enhancing performance of mirroring operations between controllers in an active-active controller pair.
A typical data processing system generally includes one or more storage units which are connected to a host computer either directly or through a control unit and a channel. The function of the storage units is to store data and other information (e.g., program code) which the host computer uses in performing particular data processing tasks.
Various types of storage units are used in current data processing systems. A typical system may include one or more large capacity tape units and/or disk drives connected to the system through respective control units for storing data. However, a problem exists if one of the storage units fails such that information contained in that unit is no longer available to the system. Generally, such a failure will shut down the entire computer system, which can create a problem for systems which require data storage systems to have high availability.
This problem has been overcome to a large extent by the use of Redundant Array of Inexpensive Disks (RAID) systems. RAID systems are widely known, and several different levels of RAID architectures exist, including RAID 1 through RAID 5, which are also widely known. A key feature of a RAID system is redundancy, which is achieved through the storage of a data file over several disk drives and parity information stored on one or more drives. If one disk drive fails, then the RAID system is generally able to reconstruct the data which was stored on the failed drive from the remaining drives in the array.
High availability is a key concern because in many applications users rely heavily on the data stored on the RAID system. In these type of applications, unavailability of data stored on the RAID system can result in significant loss of revenue and/or customer satisfaction. Employing a RAID system in such an application enhances availability of the stored data, since if a single disk drive fails, data may still be stored and retrieved from the system. In addition to the use of a RAID system, it is common to use redundant RAID controllers to further enhance the availability of such a storage system. In such a situation, two or more controllers are used in a RAID system, where if one of the controllers fails the other remaining controller will assume operations for the failed controller. Such a platform enhances the availability of a RAID system because the system can sustain a failure of a controller and continue to operate. When using dual controllers, each controller may conduct independent read and write operations simultaneously, known as an active-active configuration. It can be advantageous in certain applications to use the active-active configuration, as the RAID system can support relatively high rates of data transfer between the disks and host, although employing an active-active configuration requires mirroring of data and parity between controllers to maintain redundancy, as will be described in detail below.
With reference to FIG. 1, a RAID system 100 having an active-active controller pair is described. The RAID system 100 is connected to a host computer 104 through a host channel 108. The RAID system 100 includes a first active controller 112, a second active controller 116, and a disk array 120. The disk array 120 is connected to the first active controller 112 by a first disk channel 124 and a second disk channel 128, and to the second active controller 116 by the first and second disk channels 124, 128. The disk array 120 contains a number of disk drives 132, 136, 140, 144, 148, that are used for data storage. Within the first active controller 112, there is a processor 152 and a nonvolatile random access memory (NVRAM) 156, and within the second active controller 116 there is a processor 160 and a NVRAM 164. It should be understood that the number of drives shown in FIG. 1 are for the purpose of discussion only, and that a RAID system 100 may contain more or fewer disk drives than shown in FIG. 1. Data is written to the disk array 120 in such a way that if one drive fails, data can continue to be read from and written to the disk array 120. How this redundancy is accomplished depends upon the level of RAID architecture used, and is well known in the art.
When storing data, generally, a controller receives the data and breaks the data down into blocks which will be stored on the individual disk drives 132, 136, 140, 144, 148. The blocks of data are then arranged to be stored on the drives 132, 136, 140, 144, 148. In arranging the blocks of data, the controller organizes the blocks into stripes and generates a parity block for each stripe. The data is written across several drives, and the parity for that stripe is written to one disk drive. In certain cases, the data may not be large enough to fill a complete stripe on the RAID system. This is known as a non-full stripe write. When the data sent to the controller occupies a full stripe, the data is simply written over existing data and the parity is written over the existing parity. Additionally, in certain cases, the controller may aggregate several small writes together to create a full stripe of data, which the controller treats as a full stripe of data for purposes of generating parity. However, in the case of a non-full stripe write, modifying the stripe of data requires several steps, and is a disk intensive activity.
The occurrence of non-full stripe writes is common in many applications, such as financial, reservation and retail systems, where relatively small data records are widely used and are accessed and modified at random. When an individual customer record needs to be revised, it may reside in a stripe of data that contains several other customer data records. In such a case, only a portion of the stripe needs to be modified, while the remainder of the stripe remains unaffected by the modification of the data.
As mentioned above, when using an active-active controller pair in a RAID system, in order to maintain redundancy, data and parity must be mirrored between the controllers in the active-active system. In such a system, when the host computer 104 sends data to be written to the disk array 120, the data is typically sent to either the first active controller 112, or the second active controller 116. Where the data is sent depends upon the location in the disk array 120 the data will be written. In active-active systems, typically one controller is zoned to a specific array of drives, or a specific area within an array of drives. Thus, if data is to be written to the array that the first active controller 112 is zoned to, the data is sent to the first active controller 112. Likewise, if the data is to be written to an array that the second active controller 116 is zoned to, the data is sent to the second active controller 116. In order to maintain redundancy between the two controllers 112, 116, the data sent to the first active controller 112 must be copied onto the second active controller 116. Likewise, the data sent to the second active controller 116 must be copied onto the first active controller 112. The data is copied between controllers because, for example, if the first active controller 112 suffers a failure, the second active controller 116 can then use the copy of the data to complete any data writes which were outstanding on the first active controller 112 when it failed. This process of copying data, as well as parity, is known as mirroring.
Mirroring in such a system is typically necessary because when the host 104 sends data to be written, the controller that receives the data, stores the data in a memory location, and sends a reply to the host 104 that the write is complete. Thus, even though the data may not have been written to the disk array 120, the host 104 is notified that it has been written. This is known as write-back caching. If the controller that received the data subsequently suffers a failure prior to writing the data to the disk array 120, the data can be lost. However, if the controller mirrors the data prior to sending the host 104 a reply that the data has been written, a failure of the controller can still be recovered without loss of the data. The recovery from the failure, as will be described below, is performed by the surviving controller, which takes control of the operations of the failed controller. This process of recovering from a controller failure is known as xe2x80x9cfailing over,xe2x80x9d and the surviving controller is known to be in a xe2x80x9cfailed overxe2x80x9d mode when performing operations of the failed controller.
With reference now to FIG. 2, a flow chart representation of a data write is now described. Initially, indicated by block 200, the first active controller 112 receives new data to be written to the disk array 120 and stores the data in NVRAM 156. The first active controller 112 next initiates a write operation, as noted by block 204. The first active controller 112 then takes steps to mirror the new data to the second active controller 116, and data is stored in the NVRAM 164 of the second active controller 116, and a mirror write operation is initiated within the second active controller 116, as indicated by block 208. The mirror write operation indicates that there is an outstanding write operation on the first active controller 112, which can be used to recover the system in the event of a failure of the first active controller 112, and will be discussed in more detail below. Once the new data has been mirrored to the second active controller 116, the first active controller 112 sends the host computer 104 an acknowledgment that the write of the new data is complete, according to block 212. Next at block 216, the first active controller 112 processes the data into blocks for storage on the disk array 130 and determines if the blocks of new data will occupy a full stripe in the disk array 130.
Referring to block 220, if the new data will not occupy a full stripe in the disk array 130, the first active controller 112 reads the old data and old parity from the disk array 130. The first active controller 112 then computes new parity by XORing the old data and old parity with the new data, and stores the new parity in its NVRAM 156, as indicated by block 224. Next, a parity log is opened on the first active controller 112, as noted by block 228. The parity log is also stored in NVRAM 156, and contains pointers to the memory storage location of the parity data and user data, the location in the drives where the data will be stored, the serial number for the drives being written, the serial number of the array the drives belong to, and an array offset. Next in block 232, the first active controller 112 mirrors a parity log message to the second active controller 116. The parity log message contains the new parity, and also includes the parity log, both of which are stored in the NVRAM 164 on the second active controller 116. Accordingly, by mirroring the parity, in the event of a failure of the first active controller 112, the second active controller 116 is able to complete the write of the new data and new parity, as will be described in more detail below. With reference to block 236, the first active controller 112 next issues write commands to write the new data and new parity to the disk array 130. Once the first active controller 112 receives acknowledgment from the disk array 130 that the data and parity writes are complete, the first active controller 112 mirrors a command to the second active controller 116 to close the mirror write operation, as indicated by block 240. Next at block 244, the first active controller 112 invalidates the parity log by marking the array offset with an invalid number. The first active controller 112 then terminates the write operation, and the data write is complete, as noted by block 248.
If the first active controller 112 determines in block 216 that the new data will occupy a full stripe, the first active controller 112 then computes new parity by XORing all of the data blocks, as noted by block 252. The first active controller 112 then writes the data and parity to the appropriate stripe in the disk array 130, in accordance with block 256. The first active controller 112 then terminates the write operation, and the data write is complete, as noted by block 248.
With reference now to FIG. 3, recovery from a failure of a disk drive in an active-active controller pair is described. Initially, a hard disk drive fails, as indicated by block 300. When this occurs, the controllers recognize that a disk drive has failed, and begin operation in critical mode, as noted by block 304. When operating in critical mode, data continues to be written and read from the disk array, and the controllers 112, 116 compensate for the failed drive using the remaining drives and the parity. For example, if disk drive 136 fails, and the first active controller 112 needs to read data from the disk array 120, the first active controller 112 determines whether the failed drive 136 contained parity or data. If the failed disk drive 136 contained data, the first active controller 112 would read the data and parity from the remaining drives in the disk array 120, and compute the data for the failed drive 136 by XORing the remaining data with the parity. If the failed disk drive 136 contained parity, the first active controller 112 would simply read the data from the remaining drives.
With reference now to FIG. 4, recovery from a controller failure in an active-active controller pair is now described. Initially, the first active controller 112 suffers a failure, as noted by block 400. The second active controller 116 recognizes this failure, and takes control of the operations of the first active controller 112, as indicated by block 404. The second active controller 116 then checks for the existence of any outstanding parity logs, the presence of which indicates that the first active controller 112 had data writes outstanding, according to block 408. If no data writes were outstanding on the first active controller 112, the second active controller 116 continues operations, according to block 412.
If there are parity logs outstanding, the second active controller 116 then at block 416 issues a write command to write the new data and new parity associated with the parity log to the disk array 120. Once the data and parity writes have completed, the second active controller 116 invalidates the parity log, as noted by block 420. Once all of the outstanding write operations are complete, operations are continued using the second active controller 116, as indicated by block 424.
With reference now to FIG. 5, recovery from a controller failure and a disk failure in an active-active controller pair is now described. Initially, at block 500 the first active controller 112 and one disk drive suffer a failure. The second active controller 116 recognizes the failure of the first active controller 112, and takes control of the operations that were performed by the first active controller 112, as noted by block 504. When taking control of the operations, the second active controller 116 first determines whether any parity logs are outstanding on the first active controller 112, as indicated by block 508. If no parity logs were outstanding on the first active controller 112 at the time of the failure, the second active controller 116 continues operation in critical mode, according to block 512. If parity logs were outstanding, the second active controller 116 then writes the parity and data associated with the parity log to the disk array, ignoring any writes to the failed drive, as noted by block 516.
Once the data and parity writes have completed, the second active controller 116 invalidates the parity log, in accordance with block 520. Once all of the outstanding write operations with outstanding parity logs are complete, operations are continued in critical mode using the second active controller 116, as indicated by block 524.
As can be noted from the above discussion, mirroring parity between controllers in an active-active controller pair is required in order to provide redundancy to the RAID system 100. However, the parity is mirrored between controllers using the first disk channel 124 and the second disk channel 128. Thus, mirroring the full parity consumes bandwidth from these channels, and can reduce the performance of the system. This bandwidth consumption is magnified when the data writes are for small amounts of data. For example, it is common for a stripe of data to occupy a 64 Kbyte data block on each data disk in a disk array 120, and have a 64 Kbyte parity block on the parity drive. If the host computer has a 100 Kbyte data file to be written to a stripe of data, the data will be written to at least two of the drives within the disk array 120. When writing the data, the controller writing the data, for purposes of discussion the first active controller 112, will break the data into appropriate sections, called chunks, to be stored on the individual disk drives. When writing the data, the first active controller 112 writes one chunk at a time, and computes new parity for the stripe of data for each chunk. In this example, the first active controller would compute new parity for the first chunk of data, mirror the new parity to the second active controller 116, and write the new data and parity to the disk array 120. The first active controller 112 would then perform the same tasks for the second chunk of data to complete the data write operation. Thus, for a 100 Kbyte data write, the parity block is mirrored two times, giving 128 Kbyte of mirrored parity from the first active controller 112 to the second active controller 116. The amount of mirrored data grows if, as is common, the data write requires data to be written to more than two drives in the disk array. For example, if the data write is written to three drives, 192 Kbytes of parity are mirrored for the 100 Kbyte data write. Additionally, as can be noted from the above discussion, the full parity is only required to recover from a double failure, which is a relatively infrequent event. Thus, it would be advantageous to have a method and apparatus which reduces the amount of parity that is mirrored between controllers in an active-active controller pair while still allowing for the recovery from a single failure.
The present invention provides a system and method for enhancing performance related to mirroring parity. The system includes an array of drives that stores data and parity including at least first parity associated with a first write operation. The system also includes a first controller subsystem in communication with the array of drives. The first controller subsystem includes a first controller and a memory that stores logical block address (LBA) information associated with the first write operation. The system includes a second controller subsystem in communication with the array of drives. The second controller subsystem includes a second controller involved with the first write operation including storing the first parity with the array of drives. The first LBA information includes the most recent logical block address to which data is being written using the second controller. The first controller subsystem receives a parity log message that includes the first LBA information. The first LBA information is included in the parity log message when all drives in the array of drives are usable to store data in association with the first write operation, and the parity log message includes the parity when less than all of the drives in the array of drives are usable to store data in association with the first write operation.
If the second controller fails after the first LBA information is stored with the memory and before the first parity is stored on the array of drives, the first controller subsystem uses the first LBA information to provide the first parity in association with the first write operation. If the second controller fails and less than all of the drives in the array of drives are usable to store data, and the first LBA information is stored in memory, then the LBA information is used to mark the data associated with the first write operation as missing. The first controller is used to provide an indication that the second controller has failed when less than all of the drives in the array of drives are usable. In one embodiment, the LBA information is different from a parity log and different from the first parity, with each thereof associated with the first write operation.
The method for enhancing performance related to mirroring parity includes controlling parity-related information being stored in the memory of the first controller subsystem, with the parity-related information being associated with a first write operation that is being conducted by the second controller subsystem. The first write operation is conducted using the second controller subsystem and includes storing parity on the array of drives, with the parity being different than the parity-related information. In one embodiment, the parity related information includes information related to the LBA to which data is being written using the second controller subsystem. In this embodiment, the LBA information is the most recent LBA to which data is being written using the second controller subsystem. In one embodiment, the parity-related information is part of a parity log message provided to the first controller subsystem. When less than all drives in the array are usable to store data, parity is stored in the memory of the first controller subsystem. In one embodiment, the parity-related information is less in amount and is stored in less time than the parity. The parity related information is different from a parity log that is related to an identifier associated with the first write operation. In another embodiment, the second controller subsystem includes a second controller, and when the second controller has failed after the parity-related information is stored in the memory and before the parity is stored with the array of drives, the parity related information is used by the first controller of the first controller subsystem to provide parity for the first write operation. In another embodiment, a second write operation is performed using the first controller subsystem, including storing parity related to the second write operation, and the parity-related information is not controlled when one drive of the array of drives has failed.