“Clustering” generally refers to a computer system organization where multiple computers, or nodes, are networked together to cooperatively perform computer tasks. An important aspect of a computer cluster is that all of the nodes in the cluster present a single system image—that is, from the perspective of a user, the nodes in a cluster appear collectively as a single computer, or entity.
Clustering is often used in relatively large multi-user computer systems where high performance and reliability are of concern. For example, clustering may be used to provide redundancy, or fault tolerance, so that, should any node in a cluster fail, the operations previously performed by that node will be handled by other nodes in the cluster. Clustering is also used to increase overall performance, since multiple nodes can often handle a larger number of tasks in parallel than a single computer otherwise could. Often, load balancing can also be used to ensure that tasks are distributed fairly among nodes to prevent individual nodes from becoming overloaded and therefore maximize overall system performance. One specific application of clustering, for example, is in providing multi-user access to a shared resource such as a database or a storage device, since multiple nodes can handle a comparatively large number of user access requests, and since the shared resource is typically still available to users even upon the failure of any given node in the cluster.
Clusters typically handle computer tasks through the performance of “jobs” or “processes” within individual nodes. In some instances, jobs being performed by different nodes cooperate with one another to handle a computer task. Such cooperative jobs are typically capable of communicating with one another, and are typically managed in a cluster using a logical entity known as a “group.” A group is typically assigned some form of identifier, and each job in the group is tagged with that identifier to indicate its membership in the group. Many cluster management operations are also handled through the use of a group of cooperative jobs, often referred to as a cluster control group.
Member jobs in a group typically communicate with one another using an ordered message-based scheme, where the specific ordering of messages sent between group members is maintained so that every member sees messages sent by other members in the same order as every other member, thus ensuring synchronization between nodes. Requests for operations to be performed by the members of a group are often referred to as “protocols,” and it is typically through the use of one or more protocols that tasks are cooperatively performed by the members of a group.
Clustered computer systems place a high premium on maximizing system availability. As such, automated error detection and recovery are extremely desirable attributes in such systems. One potential source of errors is that of losses of any resources that are used in the management and operation of a clustered computer system, e.g., memory address ranges and input/output (I/O) devices.
Especially for system-critical applications that demand high availability, managing resources, and in particular, recovering lost resources, can substantially improve the reliability of an application that uses those resources. In some situations, resources are transferred between nodes and other entities in a cluster, and it is often during these transfers that the risk of losing a resource is greatest. To avoid the exposure of two entities owning the same resource, typically a clustered computer system requires that an entity giving the resource release ownership before transferring the resource. Therefore, if a failure occurs between the giving entity releasing ownership of the resource and the other entity taking ownership of the resource, the resource may be lost.
It would be highly desirable in many clustered computer systems to be able to recover lost resources so that such resources can be used by other entities. However, conventional systems have not provided any reliable manner of recovering lost resources. In addition, other types of resource-related actions, e.g., transferring resources between entities or types of entities, may also present similar risks. For example, for resources such as virtual address ranges, it may be desirable to shift resources between different entities, or between different types of entities.
Therefore, a significant need exists in the art for a manner of performing resource actions on the resources in a clustered computer system, and in particular, a manner of effectively managing resources with reduced risk of resource conflicts and other potential errors.