The present invention relates to computing systems of a type in which multiple processor units are arranged as a cluster of communicatively interconnected nodes, each node comprising one or more processor units. In particular, the invention relates to maintaining and distributing to each node configuration data identifying particular characteristics of the cluster and its elements in a fault tolerant manner.
In today's industry, there are certain computing environments, such as stock exchanges, banks, telecommunications companies, and other mission critical applications, that do not tolerate well even momentary loss of computing facilities. For this reason such environments have, for many years, relied on fault tolerant and highly available computer systems. The architectures of such systems range from simple hot-standby arrangements (i.e., a back-up computer system stands ready to take over the tasks of a primary computer system should it fail) to complex architectures which employ dedicated (and replicated) portions of the computing hardware. These latter systems may be most effective in providing continuous availability, since they have been designed with the goal of surviving any single point of hardware failure, but suffer a price premium due to the increased component cost needed for component replication. But, even with component replication, the architecture is still susceptible to a single point of failure: the operating system. One approach to the problem of a single operating system is to employ a distributed operating system.
Distributed operating systems allow collections of independent machines, referred to as nodes, to be connected by a communication interconnect, forming a "cluster" which can operate as a single system or as a collection of independent processing resources. Fault tolerance can be provided by incorporating hardware fault detect with the distribution of the operating system in the cluster. High availability is achieved by distributing the system services and providing for takes of a failed node by a backup node. With this approach, the system as a whole can still function even with the loss of one or more of the nodes that make up the cluster. Therefore, the operating system will no longer be a single point of failure. Since the operating system is providing the high availability and fault tolerance, it is no longer necessary to incorporate replicated hardware components to the extent previously used, although their use is not precluded. This can alleviate the price premium of fault tolerant hardware.
Recently, the clustering concept has been extended to computing architectures in which groups of individual processor units form the nodes of the cluster. This approach allows each node, having two or more processor units to operate as a symmetric multiprocessing (SMP) system capable of exploiting the power of multiple processor units through distribution of the operating system and thereby balance the system load of the SMP node. In addition, it may be possible for an SMP configured node to reduce downtime because the operating system of the node can continue to run on remaining processors in the event of failure of one processor.
However, in order to employ multiple SMP nodes in a cluster, and have them able to operate efficiently as a single processing environment, there should be available configuration data that provides a description of the cluster. That description will provide, for example, information such as to how many nodes make up the cluster, the composition of each node, the address of each processor unit of a node, the processes running on or available to the node(s), the users of the cluster and their preferences, and the like. Further, this configuration data should remain consistent, accurate and continuously updated across the cluster members, and herein is introduced areas of attack on the fault tolerant and high availability aspects of the cluster. Improper retention and/or distribution of the configuration data can leave it vulnerable to corruption by viruses, hackers, or even inadvertent, but well-meaning, corruption by a system administrator who makes an erroneous change. In addition, the configuration data should remain consistent across all nodes to allow all cluster members to agree e.g., as to what nodes (and the processor units they contain) are located where. Changes to the configuration data used by one node should also be made to the configuration data of the other nodes. Thus, distribution of such changes must be resistant to faults.
As reliance on computer systems continues to permeate our society, and as more services move on-line, twenty-four by seven operation and accessibility will become critical. Therefore, fault tolerance and high availability are, and will continue to become exceedingly important. Being able to offer the same level of fault tolerance and high availability in software via clustering, as can be achieved with fault tolerant hardware, will be very attractive. Highly available, fault tolerant, and scalable systems will then be able to be created from commodity components and still achieve the same level of reliability and performance as much more costly dedicated fault tolerant (FT) hardware.