MOM enables multiple computer programs to exchange discrete messages with each other over a communications network. MOM is characterized by ‘loose coupling’ of message producers and message consumers, in that the producer of a message need not know details about the identity, location or number of consumers of a message. Furthermore, when an intermediary message server is employed, message delivery can be assured even when the ultimate consumers of the message are unavailable at the time at which it is produced. This can be contrasted with Connection Oriented Middleware, which requires a computer program to have details of the identity and network location of another computer, in order that it can establish a connection to that computer before exchanging data with it. To establish a connection, both computers must be available and responsive during the entire time that the connection is active.
This invention pertains specifically to the case where an intermediary message server in employed to store and distribute messages to consumers. Although the producers and consumers (collectively referred to as clients) are loosely coupled with each other when communicating via MOM, the intermediary message servers are normally required to communicate with these clients in a connection-oriented fashion. Thus permitting senders and receivers to communicate without both being available at the same time requires the server to be available at all times. Furthermore all clients that may wish to exchange messages must be connected to the same server, or different servers which are capable or working together to achieve the equivalent functionality of a single server, i.e. to serve as a single logical server. MOM is often used in systems in which a large number of servers have to serve as one logical server, as one of the reasons for employing MOM is to alleviate the requirement of defining which programs may exchange data with each other a priori. This means that large organizations that use MOM for computer applications distributed throughout the organization, or organizations that use MOM to provide service to the general public over the Internet, must be ready to accommodate many thousands of programs communicating through a single logical server.
This invention pertains specifically to the case in which a MOM server is realized as a cluster of multiple computers. In the context of this document, we will define a cluster as a group of computers that work together to provide a single service with higher capacity and higher reliability than can be achieved using a single computer. In order to insure high reliability in the case of the failure of one or more machines in the cluster, the messages held by the server, and their associated state information, must be stored redundantly on multiple computers. This insures that the data is still available to the cluster if one computer fails.
The invention pertains to the reliable cluster that uses a primary/backup style of redundant storage. In this case, for some subset of the messages process by the message server cluster, one computer acts as the primary node. The primary node is responsible for storing the messages and actively delivering them to message consumers. One or more other computers act as backup nodes for that primary. The backup nodes are responsible for storing an identical copy of the data stored on the primary node, but they will not actively undertake to deliver stored messages to consumers. If the primary node fails, the backup node(s) must detect this failure and insure that exactly one backup is promoted to the role of primary and begins actively delivering messages to consumers. The backup(s) identify the failure of the primary through the fact that they are no longer able to communicate with it. In order to guarantee the proper behavior of the messaging system, it is important that exactly one node in the cluster act as the primary node at one time for a given subset of messages. The exact meaning of “proper behavior” will be described below.
In addition to the failure of individual computers, another type of failure that can occur in such a cluster is a network partition. The term “network partition” refers to the situation in which the data network that connects the computers in the cluster is split into two or more separate sub-networks. Each of these separate sub-networks is referred to as a partition. Each partition functions correctly except for the fact that the computers in the partition cannot communicate with the computers in the other partition. The symptom of a network partition is the same as that of node failures, namely that one or more computers become unavailable for communication. For this reason, it is, in the general case, not possible for a backup node to distinguish between an event in which it's corresponding primary node fails, and the event in which the network becomes partitioned and the corresponding primary node continues to function but is in a different partition.
This gives rise to a fundamental dilemma in the field of primary/backup style server reliability. If a primary node becomes separated from a backup node by a network partition, but the corresponding backup node assumes that it has failed, then the backup becomes a primary node. This results in the cluster having two primary nodes for the same set of messages at the same time. If both of these primaries are in contact with message consumers, then it will no longer be possible to guarantee proper behavior of the message server. If on the other hand, the primary node fails, but the corresponding backup node assumes that it is in different network partition, then the backup node will not become primary and the reliability of the message server cluster is not achieved, as no messages will then be delivered.
Message server cluster implementations according to the state of the art assume that failure to communicate with one or more computers indicates that these computers have failed. This is reasonable if one considers that computer failures occur more often than network partitions. In the case of a primary/backup reliability scheme, this leads to incorrect system behavior during network partitions, due to the fact that two computers can become primary node for the same set of messages at the same time. This invention is unique in that it provides a means to guarantee proper behavior of a clustered messaging system that uses primary/backup reliability, even during network partitions. It can do this without needing to discover if a communication failure is due to computer failure or network partitioning. As such, the invention does not provide a novel means for discovering the nature of the failure, but rather provides a means of providing primary/backup style high availability that is robust in that it guarantees the proper behavior without the need to handle both types of failure in different ways.
It is important to define the behavior that the message server cluster must exhibit in order to be considered correct. This invention guarantees correct behavior of a messaging system as defined in version 1.0.2 of the specification of the Java Message Service (JMS) application programming interface (API) published by Sun Microsystems Inc. The definition of this interface is available at http://java.sun.com/products/jms/docs.html. The key aspects of this behavior are:                Guaranteed Message Delivery: JMS defines two message delivery modes: persistent and non-persistent. It is permissible to loose non-persistent messages in the event of system failures. A JMS compliant messaging system must, however, guarantee that persistent messages can always be recovered after a system failure and that these will eventually be delivered to the intended recipient(s). Computer programs that use a messaging system to send messages with a persistent delivery mode should not need to take any additional measures to insure that the message is successfully delivered to appropriate recipient(s). It is one object of the invention to provide guaranteed persistent message delivery even in the case of network partitions separating the computers that comprise the message server cluster. (The invention also prevents the loss of non-persistent messages in the event of network partitioning, even though such loss would actually be permissible.)        Guaranteed One Time Only Delivery: JMS defines two messaging domains: point-to-point and publish/subscribe. Point-to-point messages must be delivered exactly one time to exactly one eligible recipient. These generally correspond to actions, such as depositing money in a bank account, which are not permitted to be executed more than once. Publish/subscribe messages must be delivered exactly once to each eligible subscriber. Such messages generally contain information, which may be disseminated to any number of recipients, but must be delivered exactly one time to each recipient. (In both delivery modes, the client has the option to specify that duplicate deliveries to one consumer are permissible). If a message consumer processes a message and then terminates unexpectedly before acknowledging the receipt of the message, then the message is considered undelivered according to JMS. It is a further object of the invention to guarantee both correct point-to-point and correct publish/subscribe message delivery in spite of network partitions that separate the computers comprising a message server cluster.        In Order Message Delivery: One client of a JMS messaging system may have multiple producers and consumers. These producers and consumers are grouped together into one or more sessions, where each session contains zero or more producers and zero or more consumers. All messages produced within one session must be delivered to consumers in the same order in which they were produced. The JMS specification explicitly identifies failure conditions in which it is not possible to assure both guaranteed delivery and in order delivery: when a message is delivered to a consumer, and that consumers fails before acknowledging receipt of the message, but after subsequent messages produced in the same session have been delivered to other consumers. In this case, it is permissible to deliver the message in question to another consumer, although it is out of order.        Transactions: As mentioned in the previous point, any number of the producers and consumers of one client may be grouped together into one session. Each session may optionally be specified as “transacted”. The client must instruct a transacted session to commit all message produced and consumed since the last commit before the delivery of these messages becomes effective. All production and consumption of messages that occur between successive commits must succeed or fail as a single unit. This means that despite any system failure that might occur, there are only two permissible outcomes for the set of messages produced and consumed within one transacted session between two successive commits: 1) the consumption of all received messages is verified to the messaging system at the time of commit, and the produced messages become available to deliver to consumers at the time of commit, or 2) all produced messages are aborted, and all consumed messages are refused, effectively rolling back the session to the state that it was in immediately after the previous commit. Thus, if two messages are issued within the same transaction, one instructing the withdrawal of a sum of money from a bank account, and the other instructing the deposit of the same amount into another bank account, then it is never possible for the withdrawal to occur without the corresponding deposit to occur. It is thus another object of the invention to provide correct transaction semantics in spite of network partitions that may occur before a transaction is successfully committed.        
In addition to the above, we intend for this invention to provide one additional aspect of behavior, which is not specified by JMS, but is critical to fulfilling the basic purpose of a messaging system:                The messaging system is at all time available to accept messages from message producers: JMS does not explicitly state any requirements regarding availability of the messaging system. A server based messaging system is, however, intended to alleviate computer programs for the need to implement store and forward messaging functionality themselves. A messaging system cannot fulfill this intention without providing some guarantee of availability. Moreover, being available to accept messages at from producers at all times is more critical in this respect that being available to distribute messages to consumers. The JMS specification, and messaging systems in general, do not guarantee a minimum delivery time from producer to consume. This is outside the control of the messaging system since it cannot assume that there are consumers available to receive messages at all times. For this reason, message consumers must be designed in a way that they are robust with respect to delays in message delivery. On the other hand, consider a simple message producer that interactively collects order information from a human user, packages that as a message, sends it to the message system, confirms to the user that the order has been placed, and only then is ready to process another order. If the message system is not ready to take responsibility for the message during this cycle, then either: 1) the user must wait an indefinite amount of time until the message system is available before he receives confirmation that the order was placed, or 2) the message producer must provide reliable, recoverable storage of the message until the message system is available. Both of these options defeat the purpose of using a message system in such a scenario. Therefore we consider the ability to accept produced messages at any time to be the paramount availability criterion for message server availability. It is thus yet another object of the invention to insure that a clustered message server is always available to accept messages from message producers, and to guarantee proper delivery of those messages, even when the cluster is subject to network partition.        
This invention provides robustness to network partitioning specifically for the clustered message server described in patent application Ser. No. 09/750,009, “Scaleable Message System”. For details about this messaging system, it is referred to the publication of this application, the application being incorporated herein by reference. Only a brief description will be presented here. The scaleable message system is depicted in Drawing 1. The scalable message system consists of two or more logical nodes. Each node may run on a separate computer, or multiple nodes may run on the same computer. There are two types of node: Client Managers (CM) and Message Managers (MM). Message Manager nodes are responsible for storage and distribution of messages. They are not directly accessible by Clients. Clients must connect to Client Manager nodes, and the Client Manger nodes relay Client requests to the Message Manager nodes via a reliable, atomic multicast message bus. The message bus allows data to be sent from one node to several other nodes at one time, without consuming more network resources than are required to send the same data to only one machine. Individual Client Manager nodes do not contain state information that is critical to continued function of the system, so the failure of one or more Client Manager nodes can be tolerated without the need for a backup node. If a Client Manager node fails, the Clients that were connected to it automatically reconnect to another Client Manager node and continue normal operation. Drawing 1 shows a cluster of interconnected Client Managers and Message Managers.
Message Manager nodes contain state information that is critical to the continued function of the system. For this reason their state information must be stored redundantly on multiple nodes. There are two types of Message Manager nodes: primary and backup. At any one time, each of the primary Message Manager nodes is responsible for storing and distributing a subset of the messages being transferred by the messaging system. Together, all of the primary Message Manager nodes are responsible for the complete set of messages being transferred by the messaging system at any one time. For each primary Message Manager, there is any number of backup Message Manager nodes. Each backup Message Manager node is responsible for maintaining an exact copy of the state of it's corresponding primary Message Manager node, and must be ready to assume the role of primary Message Manager if the original primary Message Manager fails. Each primary Message Manager node interacts closely with its corresponding backup Message Manager nodes, but has little interaction with other Message Manager nodes. For this reason, each groups of one primary Message Manager node and its corresponding backup Message Manager nodes is referred to as a sub-cluster.
The message server cluster described above is composed of two different types of nodes, and there are some advantages to locating these nodes on different physical computers. It is important to note that this type of message server cluster could be realized with a single type of node by combining the functionality of the Client manager and the Message Manager into a single node type. Our ability to describe the invention is, however, facilitated by using the model in which these node types are physically separate, and this model will be used throughout this document.
The invention relies heavily on the concept of a synchronous network view. The view is the list of machines with which a given node can communicate over the data network at a certain point in time. The invention assumes that all nodes that are in the view of a given node posses themselves, the same view; that is: if A is in the view of B, and C is not in the view of B, then C is not in the view of A. Thus when there is no network partition, then all nodes possess all other nodes in their view, and when the network is partitioned, then all of the nodes in a partition posses the same view, and the view does not contain any node that are in other partitions. In the preferred embodiment, the responsibility of detecting the view and reporting changes in the view is delegated to message bus that provides multicast communications within the cluster. The preferred embodiment uses the product iBus//MessageBus from Softwired AG (www.softwired-inc.com) for this purposes, as it provides both reliable, atomic, in order multicast communication and view management.