The need to store digital files, documents, pictures, images and other data continues to increase rapidly. In connection with the electronic storage of data, systems incorporating more than one storage device have been devised. In general, using a number of storage devices in a coordinated fashion in order to store data can increase the total storage volume of the system. In addition, data can be distributed across the multiple storage devices such that data will not be irretrievably lost if one of the storage devices (or in some case more than one storage device) fails. An additional advantage that can be achieved by coordinating operation of a number of individual storage devices is improved data access and/or storage times. Examples of systems that provide such advantages can be found in the various RAID (redundant array of independent disks) levels that have been developed.
High availability is a key concern because in many applications users rely heavily on the data stored on a storage system. In these types of applications, unavailability of data stored on the storage system can result in significant loss of revenue and/or customer satisfaction. Employing a RAID system in such an application enhances availability of the stored data, since if a single disk drive fails, data may still be stored and retrieved from the system. It is common to use redundant storage system controllers to further enhance the availability of such a storage system. In such a situation, two or more controllers are used such that, if one of the controllers fails, the remaining controller will assume operations for the failed controller. The availability of the storage system is therefore enhanced, because the system can sustain a failure of a controller and continue to operate. When using dual controllers, each controller may conduct independent read and write operations simultaneously. This is known as an active-active configuration. In an active-active configuration, write-back data and associated parity data are mirrored between the controllers.
When a controller in an active-active controller pair suffers a failure, the other active controller recognizes the failure and takes control of the read and write operations of the first controller. This may include the surviving controller determining whether the failed controller had data writes outstanding. If data writes are outstanding, the surviving controller issues a command to write the new data and parity to the target array or array partition. Furthermore, following the failure of a controller, the surviving controller can perform new write operations that would normally have been handled by the failed controller.
In order to provide fault tolerant connections between hosts and storage system controllers, whether directly or through intermediate switches, proper physical connections must be established. Typically, the connections between storage systems and hosts or other nodes should be completed in redundant pairs. In addition, each logical unit number (LUN) must be accessible to a host from either controller in a storage system controller pair. Moreover, even in systems that present unified LUNs, miswiring can leave the system vulnerable to loss of access should one of the controllers fail. The improper connection of nodes can also result in sub-optimal performance. However, establishing proper connections is prone to human error. In addition, improper connections are often not apparent until a failure of one controller in a controller pair occurs, because an improperly connected system will often operate normally during normal (non-failover) operation.