1. Technical Field
The present invention relates in general to cluster system management and in particular to management of very large scale clusters. Still more particularly, the present invention relates to partially distributing cluster configuration information for managing a very large scale cluster.
2. Description of the Related Art
A cluster system, also referred to as a cluster multiprocessor system (CMP) or simply as a xe2x80x9ccluster,xe2x80x9d is a set of networked data processing systems with hardware and software shared among those data processing systems, typically but not necessarily configured to provide highly available and highly scalable application services. Cluster systems are frequently implemented to achieve high availability, an alternative to fault tolerance for mission-critical applications such as aircraft control and the like. Fault tolerant data processing systems rely on specialized hardware to detect hardware faults and switch to a redundant hardware component, regardless of whether the component is a processor, memory board, hard disk drive, adapter, power supply, etc. While providing seamless cutover and uninterrupted performance, fault tolerant systems are expensive, due to the redundant hardware requirement, and fail to address software errors, a more common source of data processing system failure.
High availability utilizes standard hardware, but provides software allowing resources to be shared system wide. When a node, component, or application fails, an alternative path to the desired resource is quickly established. The brief interruption required to reestablish availability of the resource is acceptable in many situations. The hardware costs are significantly less than fault tolerant systems, and backup facilities may be utilized during normal operation. An example of the software utilized for these purposes is. the HACMP (High Availability Cluster Multiprocessing) for AIX(copyright) (Advanced Interactive Executive) software available from International Business Machines Corporation of Armonk, N.Y. and the RS6000 SP software available from International Business Machines Corporation.
The cluster system management problem is a special class of the general system management problem, with additional resource dependency and management policy constraints. In particular, the maintenance of cluster configuration information required for system management poses a special problem. The cluster configuration information required for system management is typically stored in a database, which is either centralized or replicated to more than one data processing system for high availability. The data processing system which manages a centralized cluster configuration database becomes a potential bottleneck and a single point of failure.
To avoid the problems of a centralized cluster configuration database, the database may be replicated and maintained on a number of data processing systems within the cluster. In a small cluster, the system configuration and status information may be readily replicated to all data processing systems in the cluster for use by each data processing system in performing system management functions such as failure recovery and load balancing. Full replication provides a highly available cluster configuration database and performs adequately as long as the cluster size remains small (2 to 8 data processing systems). In a very large cluster, however, the costs associated with full replication are prohibitively high.
In order to keep a distributed database in a consistent state at all times, a two-phase commit protocol may be utilized. For a fully replicated database (i.e. every data processing system has a copy), 2N messages must be exchanged for each write operation, where N is the number of data processing systems in the cluster. Thus, while the size of a cluster configuration/status database grows linearly with respect to cluster size, access time to the database grows either linearly or logarithmically with respect to cluster size. Moreover, when bringing up a cluster, the number of events (and therefore the amount of status information which needs to be updated) grows linearly with respect to cluster size. Hence, the time or cost required to bring up a cluster with a fully replicated distributed cluster configuration database grows on the order of N2. The complexity of cluster system management may thus be characterized as being on the order of N2. For very large scale cluster systems (over 1,000 data processing systems), full replication of the cluster configuration database becomes unwieldy.
Another critical issue in highly available cluster systems is how to handle network partitions. Network partitions occur if a cluster is divided into two or more parts, where data processing systems in one part cannot communicate with data processing systems in another part. When a network partition occurs, it is crucial not to run multiple copies of the same application, especially a database application such as the cluster configuration database, from these (temporarily) independent parts of the cluster. A standard way of handling this problem is to require that a cluster remain offline unless it reaches quorum. The definition of quorum varies. In some implementations, a majority quorum is employed and a portion of the cluster is said to have reached quorum when the number of active servers in that portion is at least N/2+1. A different scheme may require a smaller number of servers to be active to reach quorum as long as the system can guarantee that at most only one portion of the cluster can reach quorum. In a very large scale cluster, the condition for quorum tends to be too restrictive. A majority quorum is used herein, although the invention is applicable to other forms of quorum.
Thus, when a network partition occurs, only the portion of the cluster (if any) which contains the majority of the data processing systems in the cluster may run applications. Stated differently, no services are provided by the cluster unless at least one half of the data processing systems within the cluster are online.
It would be desirable, therefore, to provide a mechanism for maintaining a distributed database containing cluster configuration information without incurring the costs associated with full replication. It would further be advantageous for the mechanism to be scalable and applicable to clusters of any size, even those larger than 1,000 data processing systems. It would further be advantageous to permit cluster portions to continue providing services after a network partition even if a quorum has not been reached.
It is therefore one object of the present invention to provide an improved method and apparatus for cluster system management.
It is another object of the present invention to provide and improved method and apparatus for management of very large scale clusters.
It is yet another object of the present invention to provide a method and apparatus for partially distributing cluster configuration information for managing a very large scale cluster.
The foregoing objects are achieved as is now described. A cluster system is treated as a set of resource groups, each resource group including a highly available application and the resources upon which it depends. A resource group may have between 2 and M data processing systems, where M is small relative to the cluster size N of the total cluster. Configuration and status information for the resource group is fully replicated only on those data processing systems which are members of the resource group. A configuration object/database record for the resource group has an associated owner list identifying the data processing systems which are members of the resource group and which may therefore manage the application. A data processing system may belong to more than one resource group, however, and configuration and status information for the data processing system is replicated to each data processing system which could be affected by failure of the subject data processing systemxe2x80x94that is, any data processing system which belongs to at least one resource group also containing the subject data processing system. The partial replication scheme of the present invention allows resource groups to run in parallel, reduces the cost of data replication and access, is highly scalable and applicable to very large clusters, and provides better performance after a catastrophe such as a network partition.
The above as well as additional objects, features, and advantages of the present invention will become apparent in the following detailed written description.