1. Technical Field
The present invention relates to multi-requester storage systems in general, and in particular to multi-requester storage systems having Active/Passive paired storage controllers. Still more particularly, the present invention provides a method for detecting and preemptively applying ameliorative measures to potential logic unit thrashing in a multi-requester storage system having an Active/Passive paired storage controller.
2. Description of the Related Art
It is well-known in the field of computing to have controllers for controlling resources. In many computer systems, controllers are provided in pairs that can be switched, either to balance processing load or to provide some level of redundancy in the event of a controller failure. Also, resources are presented to host applications and system software in a logical rather than a physical representation. For example, storage controllers may control storage disks to position data representing what a user sees as a “real” disk drive in some scattered form on the surfaces of several disks, perhaps to optimize the use of the disk surfaces or to optimize the time taken to seek the data. To the user, such scattering is invisible, as the representation to user programs is as a logical unit having a disk, such as a “C-drive” on a typical personal computer. One example of such a resource controller system is a system comprising a pair of storage controllers that present disk storage to input/output (I/O) requesters in the form of logical units (LUNs).
Most Redundant Array of Independent Disk (RAID) controllers operate in pairs, each presenting an image of the same set of RAID arrays (or partitions of RAID arrays) made out of storage shared by both RAID controllers. If one of the RAID controllers fails, the RAID arrays are still accessible to requesters via the other controller and thus there is no single point of failure. The implementations of such RAID controllers can be split into two categories. Some RAID controllers are classed as Active-Active and allow simultaneous access to the same RAID array via either controller in the pair with no or virtually no degradation in performance. Other RAID controllers are classified as Active-Passive and allow only one controller to access a particular RAID array (or sometimes a partition of the RAID array) at a time. Some Active-Passive controllers require a particular sequence of commands to be issued to change which controller is active, while other controllers will automatically attempt to swap which controller is active depending on which controller receives I/O requests.
Allowing a single requester system to access storage presented by an Active-Passive pair of controllers is relatively straightforward. There are many implementations of software that can be used in the requester system for detecting the presence of an active controller and a passive controller, both presenting the same storage and present these controllers as a single storage device, or LUN, to the requester. Such software is also responsible for detecting when there is a problem in completing I/O requests via the active controller and performing the necessary tasks to make the other controller active—automatically failing over to use the other controller without actually failing an I/O request back up to higher levels of software executing on the requester system.
However, there is a problem when multiple requester systems share storage that is presented by an Active-Passive pair of controllers, namely, that all the requesters must agree which controller is active and must coordinate when to change the active controller. Otherwise, a condition known as “LUN thrashing” occurs. In such a condition, different requesters attempt to make a different controller the active controller of the pair. The result is that the role of active controller repeatedly swaps and as the time to swap the active controller is usually orders of magnitude longer than the time to process an I/O request, the result is a dramatic drop in the rate at which the controllers can process I/O. This result of LUN thrashing can present a serious performance problem, especially in large and complex, enterprise-level, storage systems, such as storage-area networks (SANs).
Until recently, it has not been necessary for multiple requesters to directly access the same storage and thus the problem has not been previously addressed. However, as new direct access shared file systems (such as the IBM® Storage Tank), clustered storage appliances (such as IBM® 2145 Total Storage Virtualization Engine) and other SAN-based virtualization solutions become more common, the requirement for multi-requester access to storage has come into existence.
One solution to avoid the problem of LUN thrashing altogether is to use Active-Active RAID controllers. However, Active-Active controllers are more expensive and do not allow SAN-based virtualization solutions to work with existing Active-Passive RAID controllers.
A second solution to avoid the problem of LUN thrashing in storage systems having multiple requesters is to allow only one requester to have direct access to a particular LUN or RAID array and all I/O requests from other requesters are forwarded to that requester. This is effectively what network file systems such as NFS do. However, there is a significant performance penalty in forwarding all I/O requests from other requesters to a single requester. Again, this solution does not help in SAN-based virtualization systems.
A third solution is to allow RAID controllers themselves to control which controller will be active and which controller will be passive. The disadvantage of this solution is that if both RAID controllers are functional but not all requesters can communicate with both RAID controllers because of a problem at the SAN level, the controllers may make the wrong decision as to which controller should be active and thus introducing a single point of failure.
Consequently, it would be desirable to provide an improved method for detecting potential LUN thrashing in a multi-requester storage system having Active/Passive paired storage controllers, and preemptively applying ameliorative measures.