A “cluster” is the result of “clustering” computing resources together in such a way that they behave like a single resource. Clustering is often used for purposes of parallel processing, load balancing, and fault tolerance. One common example of a cluster is a set of computers, or “nodes,” that are configured so that they behave like a single computer. Each computer in the cluster has shared access to a set of resources. A resource is, generally, any item that can be shared by the computers in the cluster. A common example of a resource is a block of memory in which information is stored. The block of memory may be part of a node in the cluster or may be external to the cluster, such as a database block.
A cluster comprises multiple nodes. Each node executes an instance of a server. Each server in a cluster facilitates access to a shared set of resources on behalf of clients of the cluster. One example of a cluster is a database cluster. In a database cluster, each node executes an instance of a database server. Each database server instance facilitates access to a shared database. Among other functions of database management, a database server governs and facilitates access to the database by processing requests by clients to access data in the database.
Sometimes, an operation that a database server instance is performing might be affected by some problem or obstacle or detrimental effect. For example, a server instance might be attempting to perform an input/output (I/O) operation relative to a certain block of data that resides in the database. Due to reasons that are unknown to the server instance, the operation might be taking much longer to return a result than the server instance expects. For example, the server instance might expect that an I/O operation will take no more than 1 minute, but 5 minutes after initiating the I/O operation, the server instance might still be waiting for a result of the I/O operation. The I/O operation might be taking a long time to return a result simply because the database is stored on slower hardware, such as a relatively slow hard disk drive or set of disks. However, the server instance has no way of knowing that this is the reason for the unexpected delay.
For another example, a storage system layer, which is logically situated beneath the database layer in which the database server instances operate, might manage I/O operations in a manner that is completely obscured from the database server instances. In such a configuration, the database server instances might send, to the storage system layer interface, read and write requests, but the database server instances may be unaware of exactly how the storage layer fulfills those requests. In some cases, the storage layer might operate upon a redundant array of independent disks (RAID). The database's data might be distributed among several different disks in the RAID. The storage system layer obscures this fact from the database server instances so that the database server instances are spared from having to determine which of the disks contain data upon which database operations are to be performed. To the database server instances, the RAID appears to be a single device. The storage system layer handles the task of identifying which disks in the RAID contain the data upon which the server instances request operations to be performed.
Under some circumstances, some of the disks in the RAID might be mirrors of other disks in the RAID, such that certain pairs of disks contain duplicate, or redundant, data. This redundancy is often desirable so that the data will have a greater chance of being constantly available despite the potential failure of one or more of the physical disks in the RAID. The database server instances typically will be unaware of this mirroring. Thus, when agents in the storage layer determine that one of the disks in the RAID contains faulty data or has otherwise experienced some failure, the database server instances likely will not be aware of the fact that the storage layer agents are attempting to switch over the performance of an I/O operation from the faulty disk to the mirror of the faulty disk. This switch-over to the mirror may cause the I/O operation to take much longer than normal to return a result. However, the database server instances probably will not have any way of knowing that the atypical delay is due to a switch-over being performed in the storage layer. Under such circumstances, the database server instances can only tell that the I/O operation is not returning a result. The database server instances will not know the reason for this.
Typically, in a database system, the database server instances are configured to wait for a specified amount of time before determining that an I/O operation has failed. For example, each server instance may be configured to wait for 1 minute for a result of an I/O operation to be returned from the storage layer. If 1 minute passes without a result being returned, then the server instance that requested the performance of the operation determines that the operation has failed. This is especially unfortunate under circumstances in which the delay is due to some remedial action (such as a switch-over to a mirror) being performed in the storage layer, which would have been completed if the database server instance had just waited a little while longer than the configured time-out amount.
Introducing a further complication, various different storage layer vendors provide storage equipment that operates at different speeds. The amount of time required by one vendor's RAID to perform a switch-over might differ significantly from the amount of time required by another vendor's RAID to perform such a switch-over. Due to the great variance between different vendors' storage facilities, it is difficult for the designer or administrator of a database cluster to determine precisely how long each server instance should wait for a result of an operation before determining that the operation has timed-out and failed.
In a database cluster, the consequences of determining that an operation has timed-out can be severe. As is mentioned above, multiple server instances in the cluster share access to the same set of disks. Because of this sharing, locking mechanisms are usually implemented in the cluster to ensure that no server instance is able to access data that another server instance is currently modifying. Thus, a particular server instance may, in some cases, obtain a lock on a resource (such as a particular data block in the database) before modifying the data contained in that resource. Depending on the kind of the lock, other server instances might be prevented from accessing (at least in certain ways) the resource while the particular server instance holds the lock on the resource. Under such circumstances, the other server instances are required to wait for the particular server instance to release the lock on the resource before those other server instances can continue their work. Server instances that are waiting for the particular server instance to release a lock are said to be “blocked.”
If a server instance process remains blocked for an amount of time that exceeds some specified threshold (e.g., 70 seconds), then monitoring agents in the database system may determine that the server instance has become stuck. In response to determining that a process is stuck in this manner, the monitoring agents attempt to determine the reason why the server instance process is stuck. In some cases, the monitoring agents might be able to resolve the problem by causing the lock for which the server instance is waiting to be released. However, if the monitoring agents determine that the server instance process is stuck while performing a read operation (which should not require the server instance to obtain a lock), then the monitoring agents might not be able to remedy the situation while permitting the server instance process to continue. A complication arises from the possibility that the stuck server instance process might currently be holding locks on other resources—locks for whose release yet other server instances in the cluster are waiting. If the stuck server process is allowed to wait indefinitely, then all of the other server instances will be forced to wait also. This could, in some situations, result in a massive gridlock of the entire database cluster.
In order to prevent such massive gridlock from occurring, under circumstances in which the monitoring agents cannot solve the stuck-server problem by forcing the release of some set of locks (e.g., because the stuck server instance is stuck while performing a read operation that does not require the server instance to obtain a lock), the monitoring agents may instruct the server instance process to terminate itself gracefully. Upon receiving such a termination instruction, the server instance process exits of its own volition and ceases executing. As the server instance process terminates itself, the server instance process also releases the locks that it might be holding on other resources in the database, so that other server instances that are waiting for those locks can finally obtain those locks and continue performing their work.
Unfortunately, even the termination of a server instance process in this manner will sometimes fail to solve the problem that is afflicting the cluster as a whole. If the terminated server instance was unduly delayed only because the storage layer was busy performing a switch-over to a mirror, as discussed above, then other remaining server instances that attempt to access the same data that the terminated server instance was accessing also will become stuck. Assuming that these other server instances experience a similar delay and are handled in the same manner as the previously terminated server instance, each of these other server instances also will be terminated. If the data blocks involved in the switch-over are very popular, such that those data block are often the subject of read operations in the cluster, and if the switch-over operation lasts long enough, then the undesirable result might be the termination of all, or nearly all, of the server instances in the cluster.