Modern database systems are often configured for high-performance and high-availability. In some installations, multiple computing nodes (e.g., in a clustered environment) are used to deliver high-performance read/write application access by deploying respective different applications (e.g., accounts payable, accounts receivable, etc.) in a multi-instance configuration where each computing node runs one or more concurrent instances. Often, high-availability is fostered by the deployment of a standby database that serves applications such as report generation. One or more instances of a standby database is provisioned on a computing node different from the computing nodes used by the aforementioned read/write applications.
Database systems that support multiple concurrent applications strive to manage concurrent access by using semaphores or other forms of locks, and often the semaphores or locks are managed by a single “master” lock manager process running on one of the computing nodes.
If the computing node on which the lock manager process is running fails, or if the lock manager process itself fails, then the locking mechanism to prevent conflicting writes to the database fails to perform as intended, and the database becomes at risk of being corrupted unless remedial steps are taken. Legacy remedial steps have included an immediate and forced shutdown of any instances that own a lock. While such sorts of remedial steps often serve to prevent corruption of the database, less intrusive techniques for recovering after a failure are needed.
In some deployments, the lock manager process is configured to run on the same computing node as the standby database, thus in the event of a failure of the computing node running the standby database, both the lock manager and the standby instance need to be redeployed in order to return the cluster to its pre-defined high-availability configuration. Again, legacy remedial steps to return the cluster to its high-performance and high-availability configuration have included manual re-provisioning of the standby node. While such sorts of remedial steps often serve to return the cluster to its pre-defined high-availability configuration, more graceful techniques for recovering after a failed standby node are needed.