1. Field of the Invention
Embodiments of the invention generally relate to redundant computer systems. More specifically, this disclosure relates to a method and apparatus for preventing concurrency violations among resources in redundant computer systems, such as, in clustered computer systems.
2. Description of the Related Art
Computer systems and their components are subject to various failures. These failures are generally related to devices, resources, applications, or the like. Many different approaches to fault-tolerant computing are known in the art. Fault tolerance is the ability of a system to continue to perform its functions, even when one or more components of the system have failed. Fault-tolerant computing is typically based on replication of components (i.e., redundancy) and ensuring for equivalent operation between the components. Fault-tolerant systems are typically implemented by replicating hardware and/or software (generally referred to as resources), such as providing pairs of servers, one primary and one secondary. Such a redundant system is often referred to as a server cluster, clustered computer system, clustered environment, or the like. A server in a clustered environment is generally referred to as a node or cluster node. The failover of resources in the clustered system is handled by clustering software that is distributed among the cluster nodes.
In a clustered environment, a resource should be active (referred to as “online”) on only one of the cluster nodes. To be aware of the resource state on all the cluster nodes, the clustering software periodically performs offline monitoring of the resources on the cluster nodes where such resources are supposed to be offline. If the clustering software finds a resource to be online when such resource should be offline (due to, accidental or manual start of the resource by a user), the clustering software deactivates the resource (takes the resource offline). A resource that is online on more than one cluster node results in a “concurrency violation.”
Conventionally, clustering software periodically polls for concurrency violations at particular intervals. Such an approach, however, delays response to concurrency violations. For example, if a resource is accidentally started by the user without using the clustering software, then the clustering software may take a few minutes to detect, report, and act on the concurrency violation. In this time interval, there is a risk of data corruption on the cluster nodes due to the resource being online concurrently on more than one node. Accordingly, there exists a need in the art for a method and apparatus for handling concurrency violations, for example, in a clustered environment.