The need to store digital files, documents, pictures, images and other data continues to increase rapidly. In connection with the electronic storage of data, systems incorporating more than one storage device have been devised. In general, using a number of storage devices in a coordinated fashion in order to store data can increase the total storage volume of the system. In addition, data can be distributed across the multiple storage devices such that data will not be irretrievably lost if one of the storage devices (or in some case more than one storage device) fails. An additional advantage that can be achieved by coordinating operation of a number of individual storage devices is improved data access and/or storage response times. Examples of systems that can provide such advantages can be found in the various RAID (redundant array of independent disks) levels that have been developed.
High availability is a key concern because in many applications users rely heavily on the data stored on the RAID system. In these types of applications, unavailability of data stored on the RAID system can result in significant loss of revenue and/or customer satisfaction. Employing a RAID system in such an application enhances availability of the stored data, since if a single disk drive fails, data may still be stored and retrieved from the system. In addition to the use of a RAID system, it is common to use redundant RAID controllers to further enhance the availability of such a storage system. In such a situation, two or more controllers are used such that, if one of the controllers fails, the remaining controller will assume operations for the failed controller. The availability of the storage system is therefore enhanced, because the system can sustain a failure of a single controller and continue to operate. When using dual controllers, each controller may conduct independent read and write operations simultaneously. This is known as an active-active configuration. In an active-active configuration, customer data, including write-back data and associated parity data, and metadata are mirrored between the controllers.
In a system using two controllers, data sent from the host to be written to the disk array is typically sent to either the first active controller or the second active controller. Where the data is sent depends upon the location in the disk array to which the data will be written. In active-active systems, typically one controller is zoned to a specific array of drives or a specific area, such as a partition or logical unit number (LUN). Thus, if data is to be written to the array or array partition that the first active controller is zoned to, the data is sent to the first active controller. Likewise, if the data is to be written to an array or array partition that the second active controller is zoned to, the data is sent to the second active controller. In order to maintain redundancy between the two controllers, the data sent to the first active controller must be copied on to the second active controller. Likewise, the data sent to the second active controller must be copied onto the first active controller.
When a controller in an active-active controller pair suffers a failure, the other active controller recognizes the failure and takes control of the write and read operations of the first controller. This may include the surviving controller determining whether the failed controller had data writes outstanding. If data writes are outstanding, the surviving controller may issue a command to write the new data and parity to the target array or array partition. Furthermore, following the failure of a controller, the surviving controller can perform new write operations that would normally have been handled by the failed controller.
In a typical system, both controllers process individual host commands, including host direct memory access (DMA) operations, simultaneously. The primary controller then updates its metadata to describe the new customer data that it has received. In particular, the metadata for a chunk of customer data can include the RAID array (LUN), logical block address (LBA) and sectors (bitmap) that are present in the chunk customer data. In order to update the metadata maintained for the chunk of customer data by the secondary controller, the primary controller sends a message that is in addition and subsequent to the mirrored customer data. This extra message consumes bandwidth on the link between the controllers, and causes an interrupt to be generated in the secondary controller's central processing unit (CPU). In addition, because the message requires that a read-modify-write operation be performed by the CPU, the operation is slow. The secondary controller also updates its CPU memory tables or mirror hash table representing the new customer data. Accordingly, the typical process for mirroring data between paired controllers is time and bandwidth consuming.