Clustering of servers is becoming increasingly important in a wide variety of contexts, for reasons of increased functionality, higher levels of services and availability, in addition to supporting server failover. Many businesses that employ computer systems require such connectivity between servers in order to ensure the durability and improved services of the network, intranet or website employed. As referred to herein, clustering refers to a group of one or more servers, (usually called “nodes”), that work together and generally represent themselves as a single virtual server to the network. In other words, when a client connects to a set of clustered servers, it thinks that there is only a single server, rather than a plurality. When one node fails, that nodes responsibilities are taken over by another node, thereby boosting the reliability of the system.
Traditionally, all services on such a cluster have been deployed homogenously on all of the servers in the cluster. This has satisfied most demands, in that when one server fails, another server is providing the same services, and thus a client can still access those-services. However, sometimes there is a set of stateful services that need to be run on only one server in the cluster at any given time, with the ability to automatically migrate the service in the event of server failures. For example, the Java Messaging Service (JMS) subsystem guarantees that user-generated client subscriber identifiers (ids) are unique within the cluster. In order to honor such requirements, a JMS or similar service that runs on only one node in the cluster is required. These types of services are, for the purposes of this disclosure, referred to as “singleton services”, by which it is meant that the service has a single active instance in the cluster.
A singleton service should be migrated in the event of a hosting server failure. With a traditional approach, migratable, singleton services were manually targeted to a server in the cluster, and the administrator did the migration manually. This type of resolution is lacking in that it is complex, time consuming and tedious on the system administrators. In addition, the downtime of the service provided can be quite lengthy.
A new approach is desired, one which would automatically target and distribute singleton services across the servers in the cluster, in addition to migrating them automatically in the event of server failures. However, there are two sets of problems that make it difficult to provide such automation. First, when a server becomes temporarily frozen or disconnected from the cluster and is mistakenly judged to have failed, then the service may be migrated to a new server, and subsequently the original server may rejoin the cluster. In that instance, two servers would be providing the singleton service. Second, if a server is incorrectly assumed to be alive, then none of the servers in the cluster would be providing the singleton service.