The present invention relates generally to distributed computing systems, and specifically to partitioning of clusters used in distributed computing applications.
Computer clusters are widely used to enable high availablity of computing resources, coupled with the possibility of horizontal growth, at reduced cost by comparison with collections of independent systems. Clustering is also useful in disaster recovery. A wide range of clustering solutions are currently available, including 390 Sysplex, RS/6000 SP, HACMP, PC Netfinity and AS/400 Cluster, all offered by IBM Corporation, as well as Tandem Himalaya, Hewlett-Packard Mission Critical Server, Compaq TruCluster, Microsoft MSCS, NCR LifeKeeper and Sun Microsystems Project Cascade. An AS/400 Cluster, for example, supports up to 128 computing nodes, connected via any Internet Protocol (IP) network. A developer of a software application can define and use group of physical or logical computing entities (such as files, devices or processes) to run the application with in the cluster environment.
Cluster applications must generally maintain consistency among all of the entities participating in the application. When a failure occurs in a cluster environment, however, the failure may result in the cluster being divided into two or more disconnected partitions. If all of these disconnected partitions continue running the application, inconsistencies may arise, for example, inconsistencies in a database that is replicated and updated by different cluster entities. These inconsistencies may be impossible to resolve when the partitions are again merged after recovery from the failure. For this reason, cluster applications typically allow only one partition to run. The partition that is selected to run is known as the primary partition or primary component. All other partitions are blocked from, proceeding with the application. Following recovery from the failure, the entities in these other partitions are merged back with the primary partition and are again available to the application.
Distributed group communication systems (GCSs) enable applications to exchange messages within groups of cluster entities in a reliable, ordered manner. For example, the OS/400 operating system kernel for the above-mentioned AS/400 Cluster includes a GCS in the form of middleware for use by cluster applications. This GCS is described in an article by Goft et al., entitled xe2x80x9cThe AS/400 Cluster Engine: A Case Study,xe2x80x9d presented at the International Group Communications Conference IGCC 99 (Aizu, Japan, 1999), which is incorporated herein by refertnce. The GCS ensures that if a message addressed to the entire group is delivered to one of the group members, the message will be also delivered to all other live and connected members of the group, so that group members can act upon received messages and remain consistent with one another. The GCS also informs the application of the identities of the current connected set of members in the group.
xe2x80x9cEnsemblexe2x80x9d is a GCS that was developed at Cornell University, as were its predecessors, xe2x80x9cISISxe2x80x9d and xe2x80x9cHorus.xe2x80x9d Ensemble is described in the xe2x80x9cEnsemble Reference Manual,xe2x80x9d by Hayden (Cornell University, 1997), and in an article entitled xe2x80x9cHigh Performance Replicated Distributed Objects in a Partitionable Environment,xe2x80x9d by Friedman et al. (Technical Report 97-1639, Computer Science, Cornell University, 1997), both of which are incorporated herein by reference. Ensemble supports multiple concurrent partitions, of which no more than one can be primary. All group members know if they are in the primary partition and are allowed to take actions that can change their state only if they are in the primary partition. The primary partition (or primary view) must include a majority of a predefined set of group members. An Ensemble protocol known as xe2x80x9cPRIMARYxe2x80x9d is used to detect the primary partition based on this criterion.
It is an object of some aspects of the present invention to provide improved methods and systems for enabling computer applications running on a cluster of participating entities to deal with partitioning of the cluster.
It is a further object of some aspects of the present invention to provide tools for use in an application program to handle partitioning of a cluster on which the application is running and to distribute information regarding partition status.
In preferred embodiments of the present invention, a group communication system (GCS) for use in a group of computing entities provides partitioning support to software applications running in the group. The partitioning support offers a choice of partitioning strategies by means of which the entities in the group, typically comprising processes running on a cluster of computing nodes linked by a network, determine whether or not they are in the primary component when the cluster is partitioned. Preferably, the GCS includes an application program interface (API), which is used by a developer of a software application to select the desired strategy. When a change in group membership occurs while the application is running, each group member determines whether or not the group member is in the primary component using a protocol of the GCS based on the selected strategy.
The present invention thus facilitates definition of how the entities in the group are to behave in response to partitioning and membership changes, and relieves application developers of the need to program such behavior in detail at the application level. In the absence of the type of tools provided by the present invention, which are not offered by clustering solutions known in the art, it is difficult to program an application-level partitioning solution, and in most cases the application must simply stop running when a partition occurs. Whereas the Ensemble GCS, described in the Background of the Invention, can provide limited partitioning support, Ensemble allows no choice of strategies and rigidly designates the majority component as the primary one. By contrast, the API and middleware partitioning support provided by preferred. embodiments of the present invention enable the developer simply to select the strategy that is most appropriate to the needs. of the particular application. Preferably, the API offers a range of selections, which can be expanded by the application developer if desired.
Although preferred embodiments described herein are based on a GCS, it will be appreciated that the principles of the present invention may similarly be implemented in substantially any distributed computing environment in which there is a mechanism for partitioning and keeping track of membership of entities in a computing group or cluster. As noted above, such entities may comprise either physical or logical entities. Furthermore, different partitioning strategies can be selected for different applications, even when the different applications are running concurrently on the same cluster of nodes.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for controlling operation of a computer software application running on a given computing entity, which is a member of a group of mutually-linked computing entities running the application within a distributed computing system, the method including:
selecting a partitioning strategy for the application from among a plurality of available strategies;
receiving a message at the given computing entity indicative of a change in membership of the group; and
determining in accordance with the selected partitioning strategy whether the given computing entity belongs to a primary component of the group following the change in membership, such that running of the software application on the given entity is restricted if the entity does not belong to the primary component.
Preferably, selecting the partitioning strategy includes selecting a strategy for the application using an application program interface, wherein selecting the strategy most preferably includes selecting one of a plurality of predefined strategies.
In a preferred embodiment, selecting the partitioning strategy includes designating one of the, computing entities as a monarch entity, such that the. given computing entity belongs to the primary component if the given computing entity belongs to the same. component of the group as the monarch entity.
In another preferred embodiment, selecting the partitioning strategy includes selecting a dynamic voting strategy such that following the change in membership, the given computing entity is determined to belong to the primary component if the given computing entity belongs to a component of the group containing more than half of the entities of a previous primary component of the group. defined before the change in membership.
In still another preferred embodiment, selecting the partitioning strategy includes selecting a strategy such that the application continues to run on all of the computing entities substantially without restriction notwithstanding any change in membership.
Preferably, receiving the message includes receiving an indication of a partitioning of the group of entities into two or more components due to a failure in the system. In a preferred embodiment, selecting the partitioning strategy includes selecting a strategy such that there will be no primary component following the partitioning of the group, whereby running of the application is restricted on all of the computing entities following the partition.
Preferably, the computing entities include computer nodes, mutually-linked by a network, and receiving the indication includes receiving an indication of a failure in communications over the network. Further preferably, selecting the partitioning strategy includes initializing group communication system middleware responsive to the selected partitioning strategy, wherein receiving the message includes receiving a membership message from the middleware.
There is also provided, in accordance with a preferred embodiment of the present invention, distributed computing apparatus, including:
a computer network; and
a group of computer nodes, mutually-linked by the network so as to run a computer software application in, accordance with a partitioning strategy selected for the application from among a plurality of available strategies, such that when a given one of the nodes receives a message indicative of a change in membership of the group, the given node determines in accordance with the selected partitioning strategy whether the given node belongs to a primary component of the group following the change in membership, wherein running of the software application on the given node is restricted if the node does not belong to the primary component.
There is additionally provided, in accordance with a preferred embodiment of the present invention, a computer software product for controlling operation of an application running on a given computing entity, which is a member of a group of mutually-linked computing entities running the application within a distributed computing system, the product including a computer-readable medium in which computer program instructions are stored, which instructions, when read by the given computing entity, cause the entity to select a partitioning strategy for the application from among a plurality of available strategies, such that when a message is received at the given computing entity indicative of a change in membership of the group, the computing entity determines in accordance with the selected partitioning strategy whether the given computing entity belongs to a primary component of the group following the change in membership, such that running of the software application on the given entity is restricted if the entity does not belong to the primary component.
Preferably, the product is a middleware package, which includes a group communication system. Most preferably, the product includes an application program interface, with which the computer software application communicates.
The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings in which: