The present invention relates generally to a computing system using clustering principles, and more particularly to transmission of multicast (i.e., broadcast) messages to members of the system.
Many segments of today's financial and business communities (e.g., stock exchanges, banks, telecommunications companies) require computing environments that are fault tolerant and provide high availability. Downtime in these environments can be extremely costly and are not lightly tolerated. There exists a number of different approaches to providing fault tolerance and high availability. However, one enjoying increasing popularity is the employment of distributed operating systems in connection with a collection of independent processing environments, referred to as nodes, to be connected via some form of a communication interconnect to form a "cluster" which can operate as a single system or as a collection of independent processing resources. High availability and improved fault tolerance are achieved by the distributed nature of the operating system. High availability is achieved by distributing the system services and providing for their failover. With this approach, the system as a whole can still function even with the loss of one or more of the nodes that make up the system.
Regardless of how such processing system clusters are used, it is often advantageous to keep each of the processing elements of such systems up-to-date as, for example, to the system's configuration (e.g., what elements are located where, etc.). This, in turn, often will require that each node possess the capability of transmitting (i.e., broadcasting) messages to the other nodes of the cluster system. Often, such "multicast" transmissions are sent point-to-point, that is, from a sender node to a first node, then to a second node, and so on until all target nodes have been addressed. This multicast transmission procedure can require considerable processing time, increase the messaging traffic on the communicating medium of the cluster (particularly when the message is intended for every node in the cluster), and impose unacceptable restraints and limitations on system performance. Some procedures will require continuing retransmission of the message when not acknowledged by the intended receiver node. The constant retransmission of the message to all non-responding nodes further increases traffic on the network thereby degrading the overall network performance as well as occupying processor time and other cluster resources.
More importantly, however, is the need to identify the failure to receive a message, i.e., for the intended receiver to determine in some way that a message was sent, but not received. For example, if a sender node and multiple receiver nodes are interconnected by a routing network, it is not unexpected that messages can get lost and not arrive at one or more of the intended receivers (in the case of multicast transmissions). Thus, if the sender node transmits a multicast message that is received by some, but not all, of the intended receivers, those receivers that did not receive the message may well be missing needed information that can inhibit or impede system operation or proper operation of other nodes of the system.
It can be seen, therefore, that there is a need for a more efficient method of multicast transmission in a multiple processor or cluster system that also checks for and supplies possible missing multicast messages.