1. Technical Field
The present invention relates generally to a distributed data processing system and in particular to a method and apparatus for managing a server system within a distributed data processing system. Still more particularly, the present invention relates to a method and apparatus for managing a clustered computer system.
2. Description of Related Art
A clustered computer system is a type of parallel or distributed system that consists of a collection of interconnected whole computers and is used as a single, unified computing resource. The term xe2x80x9cwhole computerxe2x80x9d in the above definition is meant to indicate the normal combination of elements making up a stand-alone, usable computer: one or more processors, an acceptable amount of memory, input/output facilities, and an operating system. Another distinction between clusters and traditional distributed systems concerns the relationship between the parts. Modern distributed systems use an underlying communication layer that is peer-to-peer, There is no intrinsic hierarchy or other structure, just a flat list of communicating entities. At a higher level of abstraction, however, they are popularly organized into a client-server paradigm. This results in a valuable reduction in system complexity. Clusters typically have a peer-to-peer relationship.
There are three technical trends to explain the popularity of clustering. First, microprocessors are increasingly fast. The faster microprocessors become, the less important massively parallel systems become. It is no longer necessary to use super-computers or aggregations of thousands of microprocessors to achieve suitably fast results. A second trend that has increased the popularity of clustered computer systems is the increase in high-speed communications between computers. The introduction of such standardized communication facilities as Fibre Channel Standard (FCS), Asynchronous Transmission Mode (ATM), the Scalable Coherent Interconnect (SCI), and the switched Gigabit Ethernet are raising inter-computer bandwidth from 10 Mbits/second through hundreds of Mbytes/second and even Gigabytes per second. Finally, standard tools have been developed for distributed computing. The requirements of distributed computing have produced a collection of software tools that can be adapted to managing clusters of machines. Some, such as the Internet communication protocol suite (called TCP/IP and UDP/IP) are so common as to be ubiquitous de facto standards. High level facilities built on the base, such as Intranets, the Internet and the World Wide Web, are similarly becoming ubiquitous. In addition, other tool sets for multisense administration have become common. Together, these are an effective base to tap into for creating cluster software.
In addition to these three technological trends, there is a growing market for computer clusters. In essence, the market is asking for highly reliable computing. Another way of stating this is that the computer networks must have xe2x80x9chigh availability.xe2x80x9d For example, if the computer is used to host a web-site, its usage is not necessarily limited to normal business hours. In other words, the computer may be accessed around the clock, for every day of the year. There is no safe time to shut down to do repairs. Instead, a clustered computer system is useful because if one computer in the cluster shuts down, the others in the cluster automatically assume its responsibilities until it can be repaired. There is no down-time exhibited or detected by users.
Businesses need xe2x80x9chigh availabilityxe2x80x9d for other reasons as well. For example, business-to-business intranet use involves connecting businesses to subcontractors or vendors. If the intranet""s file servers go down, work by multiple companies is strongly affected. If a business has a mobile workforce, that workforce must be able to connect with the office to download information and messages. If the office""s server goes down, the effectiveness of that work force is diminished.
A computer system is highly available when no replaceable piece is a single point of failure, and overall, it is sufficiently reliable that one can repair a broken part before something else breaks. The basic technique used in cluster to achieve high availability is failover. The concept is simple enough: one computer (A) watches over another computer (B); if B dies, A takes over B""s work. Thus, failover involves moving xe2x80x9cresourcesxe2x80x9d from one node to another. A node is another term for a computer. Many different kinds of things are potentially involved: physical disk ownership, logical disk volumes, IP addresses, application processes, subsystems, print queues, collection of cluster-wide locks in a shared-data system, and so on.
Resources depend on one another. The relationship matters because, for example, it will not help to move an application to one node when the data it uses is moved to another. Actually it will not even help to move them both to the same node if the application is started before the necessary disk volumes are mounted. In modern cluster systems such as IBM HACMP and Microsoft xe2x80x9cWolfpackxe2x80x9d, the resource relationship information is maintained in a cluster-wide data file. Resources that depend upon one another are organized as a resource group and are stored as a hierarchy in that data file. A resource group is the basic unit of a failover.
With reference now to the figures, and in particular with reference to FIG. 1, a pictorial representation of a distributed data processing system in which the present invention may be implemented is depicted.
Distributed data processing system 100 is a network of computers in which the present invention may be implemented. Distributed data processing system 100 contains one or more public networks 101, which is the medium used to provide communications links between various devices, client computers, and server computers connected within distributed data processing system 100. Network 100 may include permanent connections, such as Token Ring, Ethernet, 100 Mb Ethernet, Gigabit Ethernet, FDDI ring, ATM, and high speed switch, or temporary connections made through telephone connections. Client computers 130 and 131 communicates to server computers 110, 111, 112, and 113 via public network 101.
Distributed data processing system 100 optionally has its own private communications networks 102. Communications on network 102 can be done through a number of means: standard networks just as in 101, shared memory, shared disks, or anything else. In the depicted example, a number of servers 110, 111, 112, and 113 are connected both through the public network 101 as well as private networks 102. Those servers make use the private network 102 to reduce the communication overhead resulting from heartbeating each other and running membership and n-phase commit protocols.
In the depicted example, all servers are connected to a shared disk storage device 124, preferably a RAID device for better reliability, which is used to store user application data. Data are made highly available in that when a server fails, the shared disk partition and logical disk volume can be failed over to another node so that data will continue to be available. The shared disk interconnection can be SCSI bus, Fibre Channel, and IBM SSA. Alternatively, each server machine can also have local data storage device 120, 121, 122, and 123. FIG. 1 is intended as an example, and not as an architectural limitation for the processes of the present invention.
Referring to FIG. 2a, Microsoft""s first commercially available product, the Microsoft Cluster Server (MSCS) 200, code name xe2x80x9cWolfpackxe2x80x9d, is designed to provide high availability for NT Server-based applications. The initial MSCS supports failover capability in a two-node 202, 204, shared disk 208 cluster.
Each MSCS cluster consists of one or two nodes. Each node runs its own copy of Microsoft Cluster Server. Each node also has one or more Resource Monitors that interact with the Cluster Service. These monitors keep the Cluster Services xe2x80x9cinformedxe2x80x9d as to the status of individual resources. If necessary, the resource Monitor can manipulate individual resources through the use of Resource DLLs. When a resource fails, Cluster Server will either restart it on the local node or move the resource group to the other node, depending on the resource restart policy and the resource group failover policy and cluster status.
The two nodes in a MSCS cluster heartbeat 206 each other. When one node fails, i.e., fails to send heartbeat signal to the other node, all its resource groups will be restarted on the remaining node. When a cluster node is booted, the cluster services are automatically started under the control of the event processor. In addition to its normal role of dispatching events to other components, the event processor performs initialization and then tells the node manager, also called the membership manager, to join or create the cluster.
The node manager""s normal job is to create a consistent view of the state of cluster membership, using heartbeat exchange with the other node managers. It knows who they are from information kept in its copy of the cluster configuration database, which is actually part of the Windows NT registry (but updated differently, as we""ll see). The node manager initially attempts to contact the other node, if it succeeds, it tries to join the cluster, providing authentication (password, cluster name, its own identification, and so on). If there""s an existing cluster and for some reason our new node""s attempt to join is rebuffed, then the node and the cluster services located on that node will shutdown.
However, if nobody responds to a node""s requests to join up, the node manager tries to start up a new cluster. To do that, it uses a special resource, specified like all resources in a configuration database, called the quorum resource. There is exactly one quorum resource in every cluster. It""s actually a disk; if it is, it""s very preferable to have it mirrored or otherwise fault tolerant, as well as multi-ported with redundant adapter attachments, since otherwise it will be a single point of failure for the cluster. The device used as a quorum resource can be anything with three properties: it can store data durably (across failure); the other cluster node can get at it; and it can be seized by one node to the exclusion of all others. SCSI and other disk protocols like SSA and FC-AL allow for exactly this operation.
The quorum resource is effectively a global control lock for the cluster. The node that successfully seizes the quorum resources uniquely defines the cluster. The other node must join with that one to become part of the cluster. This is the problem of a partitioned cluster. It is possible for internal cluster communication to fail in a way that breaks the cluster into two parts that cannot communicate with each other. The node that controls the quorum resource is the cluster, and there is no other cluster.
Once a node joins or forms a cluster, the next thing it does is update its configuration database to reflect any changes that were made while it was away. The configuration database manager can do this because, of course, changes to that database must follow transactional semantics consistently across all the nodes and, in this case, that involves keeping a log of all changes stored on the quorum device. After processing the quorum resource""s log, the new node start to acquire resources. These can be disks, IP names, network names, applications, or anything else that can be either off-line or on-line. They are all listed in the configuration database, along with the nodes they would prefer to run on, the nodes they can run on (some may not connect to the right disks or networks), their relationship to each other, and everything else about them. Resources are typically formed into and managed as resource groups. For example, an IP address, a file share (sharable unit of a file system), and a logical volume might be the key elements of a resource group that provides a network file system to clients. Dependencies are tracked, and no resource can be part of more than one resource group, so sharing of resources by two applications is prohibited unless those two applications are in the same resource group.
The new node""s failover manager is called upon to figure out what resources should move (failover) to the new node. It does this by negotiating with the other node""s failover managers, using information like the resource""s preferred nodes. When they have come to a collective decision, any resource groups that should move to this one from the other node are taken off-line on that node; when that is finished, the Resource Manager begins bringing them on-line on the new node.
Every major vendor of database software has a version of their database that operates across multiple NT Servers. IBM DB2 Extended Enterprise Edition runs on 32 nodes. IBM PC Company has shipped a 6-node PC Server system that runs Oracle Parallel Servers. There is no adequate system clustering software for those larger clusters.
In a 6-node Oracle Parallel Servers system, those six nodes share the common disk storage. Oracle uses its own clustering features to manage resources and to perform load balancing and failure recovery. Customers that run their own application software on those clusters need system clustering features to make their applications highly available.
Referring to FIG. 2b, DB2 typically uses a share nothing architecture 210 where each node 212 has its own data storage 214. Databases are partitioned and database requests are distributed to all nodes for parallel processing. To be highly available, DB2 uses failover functionality from system clustering. Since MSCS supports only two nodes, DB2 must either allocate a standby node 216 for each node 212 as shown. Alternatively, DB2 can allow mutual failover between each pair of MSCS nodes as shown in FIG. 2c. In other words, two nodes 212, 212a are mutually coupled to two data storages 214, 214a. The former double the cost of a system and the latter suffers performance degradation when a node fails. Because database access is distributed to all nodes and are processed in parallel, the node that runs both its DB2 instance and the failed over instance becomes the performance bottleneck. In other words, if node 212a fails, then node 212 assumes its responsibilities and accesses data on both data storages, but runs its tasks in parallel.
Therefore, it would be advantageous to have an improved method and apparatus for managing a cluster computer system. Such an improvement should allow support of a failover from one node to another node chosen from a group of many nodes.
The present invention provides a method and apparatus for managing clustered computer systems and extends small cluster systems to very large clusters. The present invention extends cluster manager functionality to manage the larger cluster but otherwise preserves its ease-of-use characteristics. When discussed in this application, a xe2x80x9cmulti-clusterxe2x80x9d or xe2x80x9cIBMCS clusterxe2x80x9d refers to a cluster of more than one other clusters or nodes. In one embodiment, a multi-cluster is a cluster of one or more MSCS clusters where the MSCS clusters can consist of one or more nodes.
The system clustering product extends small clusters to multi-clusters of two or more nodes. Further, the present cluster system supports resource group failover among any two nodes in a larger cluster of two or more nodes. The present system also preserves the application state information across the entire cluster in the case of failure events. Also, the present system does not change implementation of small clustering systems and does not require application vendors to make any modification to their present clustering code in order to run in this system""s environment. Instead, the present system provides an implementation of the existing cluster API DLL that is binary compatible with the a cluster API DLL.
A multi-cluster normally contains more than one pair of small clusters. The multi-cluster manager can configure a multi-cluster and the multiple clusters within. Resources in a multi-cluster are managed by each individual cluster under the supervision of Cluster Services. There is no need to modify the resource API and the cluster administrator extension API. The multi-cluster manager can use any cluster administrator extension DLL that is developed for the individual cluster as it is without modification.
Applications, whether they are enhanced for an individual cluster or not, can readily take advantage of multi-cluster system clustering features. Instead of mutual failover between one pair of nodes, the multi-cluster allows an application failover between any two nodes in a large cluster. The present invention allows a cluster to grow in size by adding an individual cluster either with a pair of nodes or a single node. The fact that the present invention can support a three-node cluster is very attractive to many customers who want to further improve availability of their mission critical applications over a two node cluster.
Applications, such as DB2 Extended Enterprise Edition, that use clusters can readily take advantage of multi-cluster system clustering features. DB2/EEE exploits MSCS features by dividing nodes into pairs and allows mutual failover between each pair of nodes as discussed above in reference to FIG. 2c. The present invention can either improve DB2 availability by supporting N-way failover or improve DB2 performance characteristics by supporting N+1 model with one standby node. In the most common event of a single node failure, DB2/EEE instance on the failed node will be restarted on the standby node and maintain the same performance in the N+1 mode. System management policy and recovery services are expressed in a high-level language that can be modified easily to tailor to special requirements from application vendors. For example, this allows DB2/EEE to be integrated with a multi-cluster better than with an individual cluster.