1. Technical Field
This invention relates in general to automated management of cluster systems, and more particularly, to integrating automated node fencing into quorum services of a cluster infrastructure for providing automated failure and recovery services at the cluster infrastructure level and for reporting a consistent, reliable view of cluster node health to distributed applications.
2. Description of the Related Art
Computer clusters or cluster environments are groups of interconnected computing elements, or nodes, associated in such a way as to facilitate interoperability and management. The nodes in a cluster may work in tandem to provide more efficient performance and availability than is typically available in a single computer. One or more nodes in a cluster may access one or more resources and one or more nodes in a cluster may share a particular resource.
A cluster infrastructure may provide for organizing nodes of a cluster into domains, where a quorum service manages the configuration or membership database indicating the role of each node in each domain as either active or in stand by. The cluster infrastructure may provide quorum services for maintaining a membership status of each node in a domain. In addition, the cluster infrastructure may provide quorum services that, upon a network failure that partitions the cluster from a group of nodes that communicate directly with one another over dedicated network connections into two sub-domains with nodes in one sub-domain that cannot communicate with nodes in the other sub-domain, control which partition retains the quorum and is allowed to continue operating an application after the failure occurs. Distributed applications running atop a cluster infrastructure may request the cluster infrastructure to provide a health status of each node for use by the distributed applications in safe control and failover of shared resources, however the cluster health status reported by quorum services alone merely indicates the membership status of a node within a quorum. Node quorum membership status alone, however, may be insufficient to guarantee safe management of shared resources when partitions occur within a cluster environment because the network failures may prevent cross node communication between the partition pieces. Because a cluster health status which indicates node quorum membership status alone may be insufficient to guarantee safe management of shared resources, a programmer may insert code into distributed applications to manage a network failure by attempting to block one or more nodes from accessing shared resources. In one example, the programmer may insert code to attempt to fence a node prior to processing a failover to prevent corrupting shared resources, where the node fencing may direct power or I/O controls to prevent one node from accessing a shared resource even when cross node communication is not available. In particular, node fencing separate nodes which may have access to a shared resource from nodes which must not have access to a shared resource.