1. Field of Art
This invention relates to shared controllable resources and more particularly relates to autonomously overriding a global resource lock of the shared controllable resources.
2. Background Technology
In a server environment where a plurality of controllable resources (e.g. storage resources such as hard drives, tape drives and optical storage drives) are shared in a joint or alternating fashion, access to a portion of the controllable resources may be made exclusive to a single resource controller (e.g. server) in order to execute a process while insuring that the coinciding data remains consistent and accurate. Typically the server environment consists of two or more resource controllers which mutually share requests from a plurality of connected host adapters, and execute those requests upon the plurality of connected storage adapters (e.g. controllable resources). One requirement of a resource controller is to be able to coordinate concurrent processes that share a plurality of controllable resources.
Typically, when a resource controller receives a request to execute a process, the resource controller will obtain a resource lock. The resource lock gives temporary exclusive control of the controllable resources required for a resource controller to execute a certain process. The resource lock may give exclusive access to a portion of a single controllable resource, an entire controllable resource, or a portion of all the controllable resources attached to the system. If a portion of the controllable resources requested is currently in use, the resource lock request is queued until the full portion of controllable resources requested is available. Having secured the resource lock, the process is executed, followed by the release of the resource lock.
In the continually evolving information age, one thing remains a constant: the need for 100% availability of mission-critical data and applications. Whether it is for stock markets, corporate payroll, e-commerce, enterprise databases, medical records, internet banking, or reasons of national security, the availability of these mission-critical resources grows inline with the demand for increased storage capacity.
One of the biggest hindrances to low total cost of ownership in the server environment is the labor associated with managing storage-related issues. Managing storage resources and data automatically by system resources, rather than manually, helps minimize this cost. However, ensuring system-wide availability of the mission-critical data and applications continues to present a unique management challenge. Mission-critical business systems typically span host and distributed computing environments, managing many of the business processes for the success of an organization. Sharing data from business processes with the other strategic systems and applications in the environment requires a comprehensive solution. Yet, the solution should be simple enough to be incorporated autonomously with minimal administrator oversight and without unduly burdening system performance.
The dominant server for such mission critical applications requiring management of large-scale databases continues to be mainframes. Mainframes, such as the IBM z9-109 class of enterprise servers, are designed for high reliability, performance, broad-based connectivity options, and comprehensive enterprise storage solutions. However, despite numerous advancements in storage management, there is still room for improvement in the area of high availability of mission-critical resources. A problem exists when a resource lock for exclusive access to all the controllable resources combined is given to a single resource controller and the controller fails while holding ownership of the lock.
For example, when communications between a dual cluster of resource controllers is severed, or when one of the resource controllers crashes, a protocol exists for a resource controller to race for global ownership of all controllable resources, a global exclusion that supersedes all existing resource locks. The first resource controller that wins the race takes ownership of all the controllable resources, whereas the resource controller that loses the race essentially becomes inactive, locked out from further accessing any of the controllable resources. This global exclusion can not be cleared until either both resource controllers are rebooted and come up with full functionality or the resource controller that loses the lock race comes back online with complete functionality and communications are restored.
Aside from the common side effects of mutual exclusion algorithms including deadlocks, starvation, and priority inversion, a problem exists in the case of the global exclusion algorithm. For example, when the resource controller goes down unexpectedly (e.g. crashes) holding the global resource lock, the other resource controller can not come up autonomously to take over the total ownership of the controllable resources. Access to all controllable resources is lost, causing complete loss in availability of mission-critical data and applications, further resulting in increased administrative workloads and storage administration costs in order to restore system resources. For example, suppose server-A and server-B race for ownership and server-A wins the race for global exclusion of all controllable resources. Server-A then crashes and is unable to come back online due to a hardware problem. Under these circumstances, server-B is isolated, and prevented from taking over the ownership of the global resource lock since the ability to release the ownership lock is lost within the offline server-A.
Conventional procedures are in place to resolve the lost lock scenario. One method involves bringing both resource controllers up together in order to clear the global resource lock, restore mutual access to system resources, and make the global resource lock available for a race in the future. However, besides requiring both resource controllers to be available for a system administrator to manually bring back up in a fully functional condition, high availability to the mission-critical data and applications is not maintained. Other conventional recoveries necessitate bringing the resource controller that is the current owner of the global resource lock back online by itself. However, this requires the resource controller with global exclusion of resources to be in a fully functional condition directly following a failure that caused it to crash. The recovery merely sets up the same scenario, except now the resource controller that holds the global resource lock has recently crashed, thereby making a repetition of the lost lock scenario more likely to happen again.
From the foregoing discussion, it should be apparent that a need exists for an apparatus, system, and method that overcome the limitations of conventional manual intervention lock override methods. In particular, such an apparatus, system, and method would beneficially be independent of administrative supervision, thereby offering autonomic device-level recovery. The apparatus, system, and method would also beneficially reduce administrative workloads and maintain high availability to mission-critical data and applications.