This invention relates generally to data storage systems. More particularly, the invention relates to the management of a data storage system by multiple disk array controllers in an n-way active configuration, such that a disk array controller can detect the failure of and reset one or more other disk array controllers in the data storage system.
Disk drives in all computer systems are susceptible to failures caused, for example, by temperature variations, head crashes, motor failure, controller failure, and changing supply voltage conditions. Modem computer systems typically require, or at least benefit from, a fault-tolerant data storage system, for protecting data in the data storage system against any instances of data storage system component failure. One approach to meeting this need is to provide a redundant array of independent disks (RAID) operated by a disk array controller (controller).
A RAID system typically includes a single standalone controller, or multiple independent controllers, wherein each controller operates independently with respect to the other controllers. A controller is generally coupled across one or more input/output (I/O) buses both to a rack of disk drives and also to one or more host computers. The controller processes I/O requests from the one or more host computers to the rack of disk drives. Such I/O requests include, for example, Small Computer System Interface (SCSI) I/O requests, which are known in the art.
Such a RAID system provides fault tolerance to the one or more host computers, at a disk drive level. In other words, if one or more disk drives fail, the controller can typically rebuild any data from the one or more failed disk drives onto any surviving disk drives. In this manner, the RAID system handles most disk drive failures without interrupting any host computer I/O requests.
Consider what would happen if a controller in a single controller system failedxe2x80x94the entire data storage system would become inoperable. And, although failure of a single controller in a data storage system that is being managed by multiple independent controllers will not typically render the entire RAID system inoperable, such a failure will render the tasks that were being performed by the failed controller, and/or those tasks scheduled to be performed by the failed controller, inoperable. In light of the above, it can be appreciated that it is not only desirable for a data storage system to reliably function in the instance that a disk drive failure occurs, but it is also desirable for the data storage system to reliably function with any type of failed component, including a failed controller.
To provide fault tolerance to a data storage system at a controller level, data storage systems managed by two controllers in dual active configuration were implemented. Referring to FIG. 1, there is shown data storage system 100 being managed by two controllers 102 and 104 in dual active configuration, according to the state-of-the-art. Controllers 102 and 104 are coupled across first peripheral bus 106, for example, an optical fiber, copper coax cable, or twisted pair (wire) bus, to a plurality of storage devices, for example, disk drives 108-112, in peripheral 114. Controllers 102 and 104 are also coupled across a second peripheral bus 116, for example, an optical fiber, copper coax cable, or twisted pair (wire) bus, to one or more host computers, for example, host computer 118.
From the viewpoint of controller 102, controller 104 is its partner controller, and from the viewpoint of controller 104, controller 102 is its partner controller. To determine when a partner controller has failed, controllers 102 and 104 are connected across ping cable 120. Each respective controller 102 and 104 is responsible for sending ping messages to the other controller 102 or 104 across ping cable 120.
Receipt of a ping message by a controller 102 or 104 from a partner controller 102 or 104, informs the receiving controller 102 or 104 that the partner controller 102 or 104 is alive, and not malfunctioning from a hardware problem or another problem. For example, when a particular controller 102 or 104 stops receiving ping messages from its partner controller 102 or 104 for a predetermined amount of time, the particular controller 102 or 104 determines that the partner controller 102 or 104, in some manner, has failed.
In the event that a controller 102 or 104 fails, the surviving controller 102 or 104 will take over the tasks that were being performed by the failed controller 102 or 104. Additionally, the surviving controller 102 or 104 will perform those tasks that were scheduled to be performed by the failed controller 102 or 104. Additionally, if the failure is of a type for which reset is an adequate solution, the surviving controller 102 or 104 will typically attempt to reset the failed controller 102 or 104 by sending it a reset signal across a reset line 122. Such reset signals are known. (It can be appreciated that, in some instances, the failed controller 102 or 104 may require replacement or repair so that a reset by a surviving controller 102 or 104 may be inadequate.)
Consider that the failure of both controllers 102 and 104 would destroy the fault tolerance and functionality of data storage system 100. It would be advantageous and desirable to manage a data storage system with more than two controllers (as in the above described dual active controller configuration), such that at least two controllers could fail before such fault tolerance and functionality of a data storage system is destroyed.
A significant problem with the state of the art, is that it does not provide any system, structure or method for a controller 102 or 104 to detect the failure of, or reset any controller 102 or 104 other than a single partner controller 102 or 104. To illustrate this, consider that ping cable 120 and reset line 122 are hardwired between controllers 102 and 104, such that respective controllers 102 and 104 can only detect the failure of and reset a partner controller 102 or 104.
For more than two controllers to manage a data storage system in active controller configuration, each respective controller would require an ability to detect and reset more than just a single other controller. According to state of the art methodologies for detecting the failure of a partner controller 102 or 104, such a controller 102 or 104 would need to be implemented to accommodate more than just one respective ping cable and reset line to detect any failures and reset more than just a single other controller in the data storage system. The design and implementation of such a backplane would typically add additional expense to the cost of a controller and a data storage system. Additionally, significant manual intervention, by a human system administrator, may be required to add and connect such ping cables and reset lines between the controllers, possibly even necessitating the system to be shut-down during such intervention.
Therefore, there is a need for a data storage system that is managed by more than just two controllers in active controller configuration. There is a need for each controller in such a data storage system to be able to detect the failure of and reset more than just a single other partner controller in the data storage system. To accomplish this, it is desirable that such a controller will not require a redesign of the controller""s backplane to accommodate an arbitrary number of ping cables and reset lines.