The present invention relates generally to distributed computing systems, and specifically to distributed computing applications driven by group communication systems.
Computer clusters are widely used to enable high availability of computing resources, coupled with the possibility of horizontal growth, at reduced cost by comparison with collections of independent systems. Clustering is also useful in disaster recovery. A wide range of clustering solutions are currently available,
Distributed group communication systems (GCSs) enable applications to exchange messages within groups of clustered entities in a reliable, ordered manner. For including 390 Sysplex, RS/6000 SP, HACMP, PC Netfinity and AS/400 Cluster, all offered by IBM Corporation, as well as Tandem Himalaya, Hewlett-Packard Mission Critical Server, Compaq TruCluster, Microsoft MSCS, NCR LifeKeeper and Sun Microsystems Project Cascade. An AS/400 Cluster, for example, supports up to 128 computing nodes, connected via any Internet Protocol (IP) network. A developer of a software application can define and use groups of physical computing entities (such as computing nodes or other devices) or logical computing entities (such as files or processes) to run the application within the cluster environment. In the context of the present patent application and in the claims, such entities are also referred to as group members, and the term xe2x80x9centityxe2x80x9d is used to refer interchangeably to physical and logical computing entities.
Distributed group communication systems (GCSs) enable applications to exchange messages within groups of clustered entities in a reliable, ordered manner. For example, the OS/400 operating system kernel for the above-mentioned AS/400 Cluster includes a GCS in the form of mlddleware for use by cluster applications. This GCS is described in an article by Goft et al., entitled xe2x80x9cThe AS/400 Cluster Engine: A Case Study,xe2x80x9d presented at the International Group Communications Conference IGCC 99 (Aizu, Japan, 1999), which is incorporated herein by reference. The GCS ensures that if a message addressed to the entire group is delivered to one of the group members, the message will also be delivered to all other live and connected members of the group. In this way, the group members can act upon received messages and remain consistent with one another. Another function performed by the GCS is to verify periodically that all of the members are xe2x80x9calive,xe2x80x9d i.e., functioning and able to perform their part in the distributed application. When one of the members fails a liveness check, or when members leave or join the group for some other reason, the GCS notifies the other members of the membership change.
Another well-known GCS is xe2x80x9cEnsemble,xe2x80x9d which was developed at Cornell University, as were its predecessors, xe2x80x9cISISxe2x80x9d and xe2x80x9cHorus.xe2x80x9d Ensemble is described in the xe2x80x9cEnsemble Reference Manual,xe2x80x9d by Hayden (Cornell University, 1997), which is incorporated herein by reference.
GCSs known in the art require that a group member be a single software process, i.e., an instance of a program actively running on a computer, although some GCSs allow the process to have multiple threads. All of the messages that a member process receives are enqueued in a predefined queue, and are later dequeued by that process and handled by handlers that the process invokes. Limiting GCS membership to single processes makes the semantics of a group member relatively intuitive. For instance, one of the key functions of a GCS is discovering failures of group members. When a group member is a process, then a simple process existence check can be made. When a group member is not necessarily a single process, the task of checking group member liveness becomes complicated. Furthermore, when a group member process sends out a message, it is easy to relate the process to the particular member. The relation becomes less clear if there are multiple processes that can send a message on behalf of the group member, especially when several group members can reside on a common node. Likewise, it is simple for the GCS to deliver a message to a group member if there is only a single process or a single inbound queue to receive the message. When the member is not a single process (or not a process at all), however, it may not be clear to whom the message should be delivered.
Ordered handling of GCS messages by group members also becomes difficult if the members are not all single processes. Message delivery to a single process is serializable, since the process deals with one message at a time. Therefore, handing messages to a process in a certain order makes it easy to ensure that the messages will be handled by the process in the same order. On the other hand, if a group member includes multiple processes, this serializability is no longer guaranteed. Messages are still handed to the group member in a certain order, but handling of the messages, if carried out by several different processes, can terminate in a different relative order.
It is an object of some aspects of the present invention to provide a group communication system (GCS) in which membership is not restricted to single processes.
It is a further object of some aspects of the present invention to provide tools for use in distributed applications that afford the application designer greater flexibility in defining application structure and group membership.
It is yet a further object of some aspects of the present invention to provide a group structure and communication protocols that enhance the performance of distributed applications.
In preferred embodiments of the present invention, a group communication system (GCS), for use within a group of clustered computing entities, provides a flexible group membership model. Flexible group members (FGMs) are defined generally as computing entities that take part in a distributed application. Such entities may include substantially any combination of processes, threads and callback functions (referred to herein as callbacks). Members may also comprise objects, such as files, that at least at some times during their existence have no active processes or threads, or they may comprise multiple simultaneous processes.
When a member joins the group, a communication protocol is established between the member and the GCS, which is not necessarily dependent on any particular process being executed by the member. The protocol defines how the GCS is to check the liveness of the particular member and how messages are to be conveyed by the GCS to and from the member, in the appropriate, ordered manner. Thus, while the framework of the protocol is fixed, the details vary from member to member and are generally determined only at the time that the member joins the group. These details may also be changed during a life cycle of group membership.
In some preferred embodiments of the present invention, the communication protocol between the GCS and each of the FGMs is based on a unique token, which is created by the GCS when the member joins the group. This token is used by processes and other components of the FGM whenever they need to communicate with the GCS. It enables the GCS to identify unambiguously the member that is sending the message, even when multiple processes are running as a single member, or when multiple members are resident on a single node, or when a member can run in different processes at different instants in time.
In these and other preferred embodiments of the present invention, each FGM has a set of callbacks, i.e., functions that are defined and called when required for performing distinct services. Such callbacks are defined to handle particular, respective types of messages handed to the FGM by the GCS. Preferably, a different callback is defined for each type of GCS message. Alternatively, specific callbacks may be defined for only some of the GCS message types, while another general callback simply passes the messages through to a process that is being executed by the member or to a queue served by the process. Further alternatively, for a group member that comprises only a single process, the set of callbacks may consist only of such a general, xe2x80x9cpass-throughxe2x80x9d callback, whereby the communication protocol between this member and the GCS emulates GCS protocols known in the art.
In preferred embodiments of the present invention, each FGM registers a liveness check with the GCS when it joins the group. The liveness check depends on the nature of the particular member. For example, if the member comprises a process, the liveness check preferably verifies process existence using the process identifier (as in GCSs known in the art) or otherwise verifies the responsiveness of the process to a given function invoked by the GCS. On the other hand, if the member comprises a data structure or a device, the liveness check verifies the responsiveness or functionality of the data structure or device. For example, a distributed application can use the GCS to implement a shared memory as a group member, which can be updated by several processes. The liveness of this group member, which includes no active processes, may be given by the availability of a file in the memory and/or internal consistency of the file, as determined by the application.
Most preferably, the GCS liveness check invokes one of the callbacks of the FGM, most preferably a callback that is defined specifically for this purpose. Alternatively, if a xe2x80x9cpass-throughxe2x80x9d callback is used for a group member comprising a single process, as defined above, the liveness check is similarly carried out in a manner emulating GCS protocols known in the art.
The GCS of the present invention thus overcomes limitations of systems known in the art, allowing application developers substantially greater flexibility in building distributed applications based on the GCS by generalizing the definition of a group member. Furthermore, by using callbacks, rather than processes, the present GCS reduces its consumption of system resources and increases the efficiency of message handling, since fewer context switches and memory copies are required. This low-overhead mode of message handling also tends to increase the availability of group members for carrying out application tasks.
Although preferred embodiments described herein are based on a GCS, it will be appreciated that the principles of the present invention may similarly be implemented in substantially any distributed computing environment in which there are mechanisms for ordered message handling and keeping track of xe2x80x9clivenessxe2x80x9d of entities in a computing group or cluster. As noted above, such entities may comprise either physical or logical entities.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for distributing messages among a group of member computing entities, which are mutually-linked in a distributed computing system, including conveying a sequence of messages to all of the member entities in the group in accordance with a communication protocol such that the messages are delivered in a uniform order to all of the member entities, at least one of which member entities does not include a process for at least some time during an existence of the group.
Preferably, the at least one of the member entities that does not include a process includes a data structure or, alternatively or additionally, a device.
Further preferably, conveying the sequence of messages in accordance with the communication protocol includes delivering the messages using a group communication system. Most preferably, the member entities take part in a distributed application running on the system.
There is additionally provided, in accordance with a preferred embodiment of the present invention, a method for distributing messages among a group of member computing entities, which are mutually-linked in a distributed computing system, including conveying a sequence of messages to all of the member entities in the group in accordance with a communication protocol such that the messages are delivered in a uniform order to all of the member entities, at least one of which member entities includes a plurality of processes.
There is also provided, in accordance with a preferred embodiment of the present invention, a method for transferring messages among a group of member computing entities, which are mutually-linked in a distributed computing system and which communicate in accordance with a communication protocol such that a sequence of messages is delivered to all of the member entities in the group in an order that is uniform among all of the member entities, the method including:
assigning a unique token to each of the member entities;
receiving a message, in accordance with the protocol, sent by one of the member entities, the message including the respective token; and
processing the message responsive to the token.
Preferably, processing the message includes identifying the member entity sending the message based on the token.
In a preferred embodiment, the member entity sending the message includes a plurality of objects, and wherein receiving the message sent by the member entity includes receiving respective messages from two or more of the plurality of objects, wherein the respective messages include the same token.
There is further provided, in accordance with a preferred embodiment of the present invention, in a distributed computing system, in which a group of mutually-linked member computing entities communicate in accordance with a communication protocol such that a sequence of messages is delivered to all of the member entities in the group in an order that is uniform among all of the member entities, a method for processing the messages received by the member computing entities, including:
defining for each of the member entities at least one callback function to be invoked when messages of a predetermined type are handed to the entity;
receiving a message of the predetermined type; and
invoking the callback function to handle the message.
Preferably, defining the at least one callback function includes defining a plurality of different callback functions for a corresponding plurality of different message types.
In a preferred embodiment, defining the at least one callback function includes defining a liveness check function, and receiving the message of the predetermined type includes receiving a periodic liveness check message, and invoking the callback function includes invoking the liveness check function responsive to the liveness check message so as to determine whether any of the member entities is unable to carry out its part in the application.
In another preferred embodiment, defining the at least one callback function includes defining different callback functions for different ones of the member entities. Alternatively, multiple member entities may share the same callback function or functions.
Preferably, invoking the callback function includes aborting delivery of the message if the callback function does not return within a specified time. Further preferably, invoking the callback function includes waiting for the callback function invoked by a first message in the sequence to return before invoking a callback function to handle a second, subsequent message, whereby the uniform order is maintained. Alternatively or additionally, invoking the callback function includes acquiring a permission required for the callback.
In yet another preferred embodiment, defining the at least one callback function includes defining a callback such that when messages of the predetermined type are received, they are passed to a designated process run by the member entity.
There is moreover provided, in accordance with a preferred embodiment of the present invention, in a distributed computing system, in which a group of mutually-linked member computing entities take part in a distributed computing application, a method for checking liveness of the member computing entities, including:
defining a respective liveness function for each member entity that joins the group, at least two of the member entities having different liveness functions; and
periodically invoking the liveness functions of the member entities so as to determine whether any of the entities is unable to carry out its part in the application.
Optionally, some of the member entities may share the same liveness function.
Preferably, invoking the liveness function includes conveying a liveness check request from a group communication system.
In a preferred embodiment, defining the liveness function includes defining a liveness function for at least one of the member entities substantially without reference to execution of any process by the at least one of the member entities.
In another preferred embodiment, defining the liveness function includes, for at least one of the member entities, defining a message that is sent to the member entity when the function is invoked. Preferably, periodically invoking the liveness functions includes sending the message and determining whether the object responds within a predetermined period of time.
In still another preferred embodiment, defining the liveness function includes, for at least one of the member entities, defining a function that checks the existence of a process with a specified identification.
In yet another preferred embodiment, defining the liveness function includes, for at least one of the member entities, defining a function that checks whether a queue held by the member entity is full.
There is still further provided, in accordance with a preferred embodiment of the present invention, in a distributed computing system, in which a group of nmutually-linked member computing entities take part in a distributed computing application, a method for checking liveness of one of the member computing entities, including periodically invoking a liveness function so as to determine whether the entity is unable to carry out its part in the application, substantially without reference to execution of any process by the member entity.
In a preferred embodiment, invoking the liveness function includes, for at least one of the member entities, checking whether a queue held by the member entity is full.
There is furthermore provided, in accordance with a preferred embodiment of the present invention, distributed computing apparatus, including:
a computer network; and
a group of computer nodes, mutually-linked by the network in accordance with a communication protocol such that when a sequence of messages is transmitted in the group, it is delivered to member computing entities on all of the nodes in a uniform order, at least one of which member entities does not include a process for at least some time during an existence of the group.
There is additionally provided, in accordance with a preferred embodiment of the present invention, distributed computing apparatus, including:
a computer network; and
a group of computer nodes, mutually-linked by the network in accordance with a communication protocol such that when a sequence of messages is transmitted in the group, it is delivered to member computing entities on all of the nodes in a uniform order, at least one of which member entities includes a plurality of processes.
There is also provided, in accordance with a preferred embodiment of the present invention, distributed computing apparatus, including:
a computer network; and
a group of computer nodes, mutually-linked by the network using a group communication system such that when a sequence of messages is transmitted in the group, it is delivered to member computing entities on all of the nodes in a uniform order to all of the member entities,
wherein a unique token is respectively assigned to each of the member entities, such that whenever a message is passed from any one of the member entities to the group communication system, it includes the respective unique token.
There is further provided, in accordance with a preferred embodiment of the present invention, distributed computing apparatus, including:
a computer network; and
a group of computer nodes, mutually-linked by the network in accordance with a communication protocol such that when a sequence of messages is transmitted in the group, it is delivered to member computing entities on all of the nodes in a uniform order, and such that for each of the member entities, at least one callback function is defined, to be invoked when messages of a predetermined type are handed to the entity.
There is moreover provided, in accordance with a preferred embodiment of the present invention, distributed computing apparatus, including:
a computer network; and
a group of computer nodes, mutually-linked by the network and configured to support member computing entities that take part in a distributed computing application running on the apparatus,
wherein a respective liveness function is defined or each member entity that joins the group, at least two of the member entities having different liveness functions, which liveness functions are invocable so as to determine whether any of the entities is unable to carry out its part in the application.
There is yet further provided, in accordance with a preferred embodiment of the present invention, distributed computing apparatus, including:
a computer network; and
a group of computer nodes, mutually-linked by the network and configured to support member computing entities that take part in a distributed computing application running on the apparatus,
wherein a respective liveness function is defined for at least one of the member entities so as to determine whether the entity is unable to carry out its part in the application, substantially without reference to execution of any process by the member entity.
There is still further provided, in accordance with a preferred embodiment of the present invention, a computer software product for distributing messages among a group of member computing entities, which are mutually-linked in a distributed computing system, the product including a computer-readable medium having program instructions stored therein, which instructions, when read by a computer, cause the computer to convey a sequence of messages to all of the member entities in the group in accordance with a communication protocol such that the messages are delivered in a uniform order to all of the member entities, at least one of which member entities does not include a process for at least some time during an existence of the group.
Preferably, the product includes group communication system middleware.
There is also provided, in accordance with a preferred embodiment of the present invention, a computer software product for distributing messages among a group of member computing entities, which are mutually-linked in a distributed computing system, the product including a computer-readable medium having program instructions stored therein, which instructions, when read by a computer, cause the computer to convey a sequence of messages to all of the member entities in the group in accordance with a communication protocol such that the messages are delivered in a uniform order to all of the member entities, at least one of which member entitles includes a plurality of processes.
There is additionally provided, in accordance with a preferred embodiment of the present invention, a computer software product for transferring messages among a group of member computing entities, which are mutually-linked in a distributed computing system and which communicate in accordance with a communication protocol such that a sequence of messages is delivered to all of the member entities in the group in an order that is uniform among all of the member entities, the product including a computer-readable medium having program instructions stored therein, which instructions, when read by a computer, cause the computer to assign respective unique tokens to the member entities, such that messages including the respective tokens are sent by the member entities in accordance with the protocol and are processed responsive to the tokens.
There is further provided, in accordance with a preferred embodiment of the present invention, a computer software product for use in a distributed computing system in which a group of mutually-linked member computing entities communicate in accordance with a communication protocol such that a sequence of messages is delivered to all of the member entities in the group in an order that is uniform among all of the member entities, the product including a computer-readable medium having program instructions stored therein, which instructions, when read by a computer, cause the computer to define for each of the member entities at least one callback function, to be invoked when messages of a predetermined type are handed to the entity, such that when a message of the predetermined type is received by the member entity, the callback function is invoked to handle the message.
There is moreover provided, in accordance with a preferred embodiment of the present invention, a computer software product for use in a distributed computing system in which a group of mutually-linked member computing entities take part in a distributed computing application, the product including a computer-readable medium having program instructions stored therein, which instructions, when read by a computer, cause the computer to define a respective liveness function for each member entity that joins the group, such that at least two of the member entities have different liveness functions, and to periodically invoke the liveness functions of the member entities so as to determine whether any of the entities is unable to carry out its part in the application.
There is furthermore provided, in accordance with a preferred embodiment of the present invention, a computer software product for use in a distributed computing system in which a group of mutually-linked member computing entities take part in a distributed computing application, the product including a computer-readable medium having program Instructions stored therein, which instructions, when read by a computer, cause the computer to periodically invoke a liveness function so as to determine whether a given one of the member computing entities is unable to carry out its part in the application, substantially without reference to execution of any process by the given member entity.
The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings in which: