A typical data processing system generally includes one or more storage units or devices which are connected to a host computer either directly or through a control unit and a channel. The function of the storage units is to store user data and other information (e.g., program code) which the host computer uses in performing particular data processing tasks.
Various types of storage units are used in current data processing systems. A typical system may include one or more large capacity tape units and/or disk drives connected to the system through respective control units for storing data. However, a problem exists if one of the storage units fails such that information contained in that unit is no longer available to the system. Generally, such a failure will shut down the entire computer system, which can create a problem for systems which require data storage systems to have high availability.
This problem has been overcome to a large extent by the use of Redundant Array of Inexpensive Disks (RAID) systems. RAID systems are widely known, and several different levels of RAID architectures exist, including RAID 1 through RAID 5, which are also widely known. A key feature of a RAID system is redundancy, which is achieved through the storage of a data file over several disk drives and parity information stored on one or more drives. If one disk drive fails, then the RAID system is able to reconstruct the data which was stored on the failed drive from the remaining drives in the array.
High availability is a key concern because in many applications users rely heavily on the user data stored on the RAID system. In these type of applications, unavailability of data stored on the RAID system can result in significant loss of revenue and/or customer satisfaction. Employing a RAID system in such an application enhances availability of the stored user data, since if a single disk drive fails, user data may still be stored and retrieved from the system. In addition to the use of a RAID system, it is common to use redundant RAID controllers to further enhance the availability of such a storage system. In such a situation, two or more controllers are used in a RAID system, where if one of the controllers fails the other remaining controller will assume operations for the failed controller. Such a platform enhances the availability of a RAID system because the system can sustain a failure of a controller and continue to operate. When using multiple controllers, each controller may conduct independent read and write operations simultaneously, known as active-active configuration. It can be advantageous in many applications to use the active-active configuration, as such a RAID system can support higher rates of data transfer between the disks and host. However, employing an active-active configuration requires mirroring of user data and parity between controllers to maintain redundancy, such that if one controller fails, the remaining controller can continue operations of the system without any user data loss.
With reference to FIG. 1, a RAID system 10 having an active-active controller pair is described. The RAID system 10 is connected to a host computer 14 through a host channel 18. The RAID system 10 includes a first active controller 22, a second active controller 26, and a disk array 30. The disk array 30 is connected to the first active controller 22 and the second active controller 26 by a disk channel 34. The disk array 30 contains a number of disk drives 42, 46, 50, 54, 58, that are used for data storage. Within the first active controller 22, there is a processor 62 and a nonvolatile random access memory (NVRAM) 66 which has a first region 70 and a second region 74. Within the second active controller 26 there is a processor 78 and a NVRAM 82 which has a first region 86 and a second region 90. It should be understood that the number of drives shown in FIG. 1 are for the purpose of discussion only, and that a RAID system 10 may contain more or fewer disk drives than shown in FIG. 1. User data is written to the disk array 30 in such a way that if one drive fails, user data can continue to be read from and written to the disk array 30. How this redundancy is accomplished depends upon the level of RAID architecture used, and is well known in the art.
When storing user data, generally, a controller receives the user data and breaks the user data down into chunks which will be stored on the individual disk drives 42, 46, 50, 54, 58. A chunk of data is the portion of the user data that can be stored on an individual disk drive. The chunks of data are then arranged to be stored on the drives 42, 46, 50, 54, 58. In arranging the chunks of data, the controller organizes the chunks into stripes and generates parity for each stripe. The chunks of data are written across several drives, and the parity for that stripe is written to one disk drive.
As mentioned above, when using an active-active controller pair in a RAID system, in order to maintain redundancy, data and parity must be mirrored between the controllers in the active-active system. In such a system, when the host computer 14 sends user data to be written to the disk array 30, the user data is typically sent to either the first active controller 22, or the second active controller 26. Which controller the user data is sent to depends upon a number of factors, including the LUN the user data will be written to. A LUN is a SCSI term referring to a unit containing addressable storage, with each controller typically being assigned to or owning one or more of a number of LUNs. Thus, if the user data is to be written to a LUN that the first active controller 22 owns, the user data is sent to the first active controller 22. Likewise, if the user data is to be written to a LUN that the second active controller 26 owns, the user data is sent to the second active controller 26. In order to maintain redundancy between the two controllers 22, 26, the user data sent to the first active controller 22 must be copied onto the second active controller 26. Likewise, any user data sent to the second active controller 26 is copied onto the first active controller 22. The user data is copied between controllers because, for example, if the first active controller 22 suffers a failure, the second active controller 26 can then use the copy of the user data to complete any data writes which were outstanding on the first active controller 22 when it failed. This process of copying user data, as well as parity, between active controllers is known as mirroring.
Mirroring in such a system is typically necessary because when the host 14 sends user data to be written, the controller that receives the user data, stores the user data in a memory location, and sends a reply to the host 14 that the write is complete. Thus, even though the user data may not have been written to the disk array 30, the host 14 is notified that it has been written. If the controller that received the user data subsequently suffers a failure prior to writing the user data to the disk array 30, the data can be lost. However, if the controller mirrors the user data prior to sending the host 14 a reply that the data has been written, a failure of the controller can still be recovered without loss of the data, using the copy of the user data that was mirrored to the other controller.
In many RAID applications, the controller(s) communicate with the host computer and disks using a Small Computer System Interface (SCSI) bus architecture. SCSI systems are widely available, with a variety of SCSI controllers available. Different components within a SCSI system communicate using either SCSI commands, or SCSI data transfers. SCSI commands are generally 10 byte or 12 byte command blocks, with command blocks of up to 16 bytes (128 bits) available in certain applications. Many SCSI controllers can process SCSI commands of up to 12 bytes without having to treat the command in a special manner. In these types of systems, SCSI commands of greater than 12 bytes are treated differently, often requiring processor attention for each individual byte of command beyond 12 bytes.
SCSI data transfers are used to transfer data between components in a SCSI system. A SCSI data transfer is generally required when a block of data needs to be transferred. When initiating a SCSI data transfer, the SCSI controller receiving the data transfer must be notified that the data is coming, in order to prepare to receive the data. Thus, a data transfer in a SCSI system in generally at least a two step process, a SCSI command is first sent to the receiving SCSI controller, followed by one or more SCSI data transfers.
When mirroring data between controllers 22, 26, several operations are performed by each controller 22, 26. With reference now to FIG. 2, a flow chart representation of a data mirroring operation is described. Initially, indicated by block 200, the host computer 14 sends user data to the first active controller 22, which opens a write operation. Next, as noted by block 204, the first active controller 22 sends a notification mirror message, which causes an interrupt, to the second active controller 26, notifying the second active controller 26 that data is going to be mirrored. An interrupt is a signal sent from a device, in this example the first active controller 22, to a processor, in this example the processor 78 portion of the second active controller 26, which causes the processor 78 to stop what it is doing and service the interrupt. When the second active controller 26 receives the interrupt, it discontinues any processing activity, and processes the notification mirror message. The notification mirror message is a SCSI command, and will be described in more detail below. Next, as indicated by block 208, the first active controller 22 sends a metadata mirror message including metadata, which causes an interrupt, to the second active controller 26. The metadata mirror message is a data transfer between controllers, will be described in more detail below. The second active controller 26 then marks its NVRAM 82 as invalid for the data blocks specified in the metadata, as noted by block 212. Next at block 216, the first active controller 22 sends a data mirror message containing the user data, which causes an interrupt, to the second active controller 26. The data mirror message is a data transfer between controllers, which contains the actual user data. The second active controller 26 receives the user data, stores it in the appropriate NVRAM 82 location, marks the NVRAM 82 as valid for the associated data blocks, as indicated by block 220. At block 224, the second active controller 26 sends a status message to the first active controller 22, indicating that the data was received and stored. The first active controller 22 upon receiving the status message sends the host computer 14 an indication that the write is complete, and terminates the write operation as noted by block 228. The first active controller 22, at block 223, then writes the user data to the disk array 30. As can be seen from this example, the controller receiving the mirrored data receives at least four interrupts. At least two interrupts associated with data transfers between controllers, and two interrupts associated with SCSI commands which indicate mirroring operation are starting or are complete. As will be understood by those of skill in the art, a processor may disconnect during a data transfer. In such a situation, the sending controller will, after a predetermined time, resume sending the data, causing an additional interrupt. Thus, the minimum number of interrupts associated with a data transfer as described will be four interrupts, with additional interrupts possible if there are disconnects between the controllers when data is being transferred.
As described above with reference to block 208, the first active controller 22 sends a notification mirror message, which causes an interrupt, to the second active controller 26. The notification mirror message serves to simply place the second active controller 26 on notice that metadata is going to be mirrored. A separate notification mirror message is required because, as described above, prior to the metadata mirror message, the second active controller 26 needs to be aware a data transfer is coming. The notification mirror message is typically a 10 byte SCSI command, with the first byte containing a predefined hexadecimal number, for example 3C. When the second active controller 26 receives the notification mirror message and interrupt, it discontinues any operations that it was conducting to service the interrupt. The second active controller 26 then processes the notification mirror message and reads the hexadecimal number in the first byte. When the second active controller 26 reads the predefined hexadecimal number, 3C in this example, it recognizes this as a mirror notification command, and disregards the remaining 9 bytes of the notification mirror message. The second active controller 26 then prepares to receive the data transfer of the metadata mirror message from the first active controller 22.
With reference now to FIG. 3, a metadata mirror message 300 is now described. The metadata mirror message 300 is a SCSI data transfer, which contains a first data region 304 and a second data region 308. The first data region 304 in this example is a 512 byte structure which contains several information fields. The information fields of the first data region 304 include a 16-bit message type 312 field, which contains information which indicates that the message contains metadata. The next field is a 16-bit message length 316 field, which contains information on the length of the current message. Following the message length 316 field, is a 32-bit sequence information 320 field, which in one example contains a number which is incremented during each mirror message, and is primarily used for troubleshooting purposes. The next field is a 32-bit starting logical block address (LBA) 324 field, which contains the starting LBA in the disk array 30 where the user data will be written. Following the starting LBA 324 field is a 32-bit number of blocks 328 field, which contains information on the number of blocks of user data to be mirrored. The next field is an 8-byte array serial number 332 field. In many applications, the disk array 30 contains several arrays of disk drives, with each array of disk drives having a unique array serial number. The array serial number field 332 contains the serial number of the array to which the user data is to be written. Following the array serial number field 332 is an 8-bit memory region field 336, which contains information on the region of NVRAM 66 that the data is to be stored in, the first region 86 or the second region 90. Following the memory region 336 field is an 8-bit array number 340 field. In many applications, each array of disk drives within the disk array 30 may contain several RAID arrays. Each RAID array is given an array number, such that if a controller 22, 26 knows the array serial number and array number, it can read and write data to and from the appropriate RAID array. Following the array number field 340, is a portion 344, which may contain data, used to fill out the 512 byte transfer.
The second data region 308 contains a 512 byte scatter/gather list 348. The scatter/gather list 348 is a list of scatter/gather elements 352. In the example of FIG. 3, the scatter/gather list 348 contains 64 scatter/gather elements 352. Each scatter/gather element 352 contains a 32-bit starting address field 356 and a 32-bit byte count field 360. The starting address field 356 contains the 32-bit address within the NVRAM 82 where the associated data will be stored. The byte count field 360 contains information on the number of bytes contained in the associated data. Thus, using the starting address field 356 and the byte count field 360, the portion of NVRAM 82 where the associated data will be stored is uniquely defined. Each scatter/gather element 352 is able to define the storage area for 16 Kbytes of user data. Thus, in the example shown, the metadata mirror message 300 can provide metadata for up to 1 Mbyte of user data. If more than 1 Mbyte of user data is sent to the controller 26 for storage, the controller 26 breaks the user data into separate pieces which are 1 Mbyte or less. The controller 26 uses the information contained in each scatter/gather element 352 to mark the appropriate portions of NVRAM 82 as invalid.
As mentioned above, the processor on the receiving controller is interrupted at least four times for each data write that is mirrored to it. Servicing these interrupts can take a significant portion of the processing resources of a controller. Additionally the metadata mirror message is 1 Kbyte, regardless of the size of the data write. Thus, for example, if 512 bytes of user data are received at the first active controller 22, the metadata mirror message will be 1 Kbyte. As can be seen, for small writes, mirroring a 1 Kbyte metadata message can be inefficient. Furthermore, when mirroring data, the controllers 22, 26 send the data over the disk channel 34, which typically have a maximum bandwidth of 160 Mb/s. When a RAID system is receiving many read and write requests from the host 14, the disk channel 34 can become bandwidth limited, increasing the time required to complete a read or write operation.
The occurrence of small writes is common in many applications, such as financial, reservation and retail systems, where relatively small data records are widely used and are accessed and modified at random. These applications are often very active as well, with many read and write requests sent to the RAID system within relatively short time periods. In these types of applications, it would be especially advantageous to reduce the processor overhead required for servicing interrupts, and to reduce the disk channel bandwidth used for mirroring data.
Thus, it would be advantageous to have a method and apparatus which (1) reduces the number of interrupts the processor of a receiving controller receives during mirroring operations, and (2) consumes less channel bandwidth when mirroring data between controllers of an active-active controller pair.