Clustered database environments allow multiple instances of a relational database management system (RDBMS) running simultaneously on separate machines to access a single shared database, which may also be distributed. In such systems, a request may be made to any of the machines, and the data will be retrieved from the single database. Such systems provide high-availability, fault tolerance, consistency of data, load balancing, and scalability. An example of such an environment is Oracle Real Application Clusters (RAC) by Oracle Corporation, 500 Oracle Parkway, Redwood Shores Calif.
In one implementation of a clustered DBMS environment, each of the machines, or “nodes,” includes a distributed lock manager (DLM) instance. DLM instances provide each DBMS instance with the ability to coordinate locking of and synchronize access to shared resources. DLM instances help to maintain concurrency on database resources, such as data blocks or files on a disk. Each node is connected to a set of shared disks that contains the database. Each database block is managed, or “mastered,” by a particular node in the cluster called a “master node.” If an RDBMS instance running on a first node needs to update a database block mastered by a second node, then the first node requests a lock from the master node (the second node), and the master node grants the requested lock to the first node.
In such an implementation, a disk may fail, but the DBMS instances may continue to successfully acquire locks on data blocks within the failed disk because the master of the failed disk continues to function, causing the node receiving the lock to assume that access is available to an unavailable resource. Likewise, the master may fail while the disk is available, causing the unavailability of a healthy shared disk for lack of a master capable of granting locks.
When a node in the cluster fails, it must be “fenced,” or cut off from access to shared resources. This process is called IO fencing (Input/Output fencing). The failed instance must be fenced to keep leftover write operations from making changes to shared storage once the recovery process begins. Once the failed node or DBMS instance is isolated from the cluster, other instances may assume that the state of the disk will not be changed by the failed instance.
However, IO fencing may be impeded by an unresponsive instance that is holding an exclusive lock on a shared resource, but is not responding. Other instances, including the master of the shared resource, may not know if the unresponsive instance is dead or alive. This prevents other instances from determining whether the unresponsive node is performing IO (Input/Output) to the data block for which it holds an exclusive lock. This is particularly troublesome when the unresponsive node may be performing a write operation to the data block, because granting a lock to a new instance on the same block may result in corruption if both instances attempt to modify the block. The unresponsive node may require rebooting to ensure that no pending IO operations exist before granting a lock to another instance.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.