“Clustering” generally refers to a computer system organization where multiple computers, or nodes, are networked together to cooperatively perform computer tasks. An important aspect of a computer cluster is that all of the nodes in the cluster present a single system image—that is, from the perspective of a user, the nodes in a cluster appear collectively as a single computer, or entity.
Clustering is often used in relatively large multi-user computer systems where high performance and reliability are of concern. For example, clustering may be used to provide redundancy, or fault tolerance, so that, should any node in a cluster fail, the operations previously performed by that node will be handled by other nodes in the cluster. Clustering is also used to increase overall performance, since multiple nodes can often handle a larger number of tasks in parallel than a single computer otherwise could. Often, load balancing can also be used to ensure that tasks are distributed fairly among nodes to prevent individual nodes from becoming overloaded and therefore maximize overall system performance. One specific application of clustering, for example, is in providing multi-user access to a shared resource such as a database or a storage device, since multiple nodes can handle a comparatively large number of user access requests, and since the shared resource is typically still available to users even upon the failure of any given node in the cluster.
Clusters typically handle computer tasks through the performance of “jobs” or “processes” within individual nodes. In some instances, jobs being performed by different nodes cooperate with one another to handle a computer task. Such cooperative jobs are typically capable of communicating with one another, and are typically managed in a cluster using a logical entity known as a “group.” A group is typically assigned some form of identifier, and each job in the group is tagged with that identifier to indicate its membership in the group.
Member jobs in a group typically communicate with one another using an ordered message-based scheme, where the specific ordering of messages sent between group members is maintained so that every member sees messages sent by other members in the same order as every other member, thus ensuring synchronization between nodes. Requests for operations to be performed by the members of a group are often referred to as “protocols,” and it is typically through the use of one or more protocols that tasks are cooperatively performed by the members of a group. One type of protocol, for example, is a membership change protocol, which is used to update the membership of a particular group of member jobs, e.g., when a member job needs to be added to or removed from a group.
Conventional clustered computer systems also typically rely on some form of cluster infrastructure software that is resident on each node in the cluster, and that provides various support services that group members utilize in connection with performing tasks. Cluster infrastructure software is roughly analogous to an operating system on a non-clustered computer system. Whereas an operating system manages the execution of software applications and provides a programming interface through which applications can invoke various support functions (e.g., to interact with attached I/O devices, to display certain information to a user, etc.), cluster infrastructure software manages the execution of group members and provides a programming interface through which jobs can invoke various cluster-related support functions (e.g., to pass messages between group members, to change group membership, etc.).
Cluster infrastructure software, like all software, may be upgraded from time to time. Thus, as with much software, each release of cluster infrastructure software is typically associated with a “version” that distinguishes that release from prior releases of the software. Upgrades to cluster infrastructure software may be desirable, for example, to provide “bug fixes” that correct errors found in previous versions of the software. However, in some instances, upgrades to cluster infrastructure software may be desirable to add new support services to the software, e.g., to add new functions and capabilities.
Whenever an operating system is upgraded on a computer, and that operating system provides a new function, it is often possible for any software applications that are resident on that computer to recognize the new version of the operating system, and as a result, take advantage of the new function. In some instances, such applications may themselves need to be upgraded as well, although in other instances, such applications may have been initially developed to work with a later version of an operating system, yet made capable of working with earlier versions as well. For this reason, many software applications are capable of detecting the version of the operating system of a computer upon which they are installed, and adapt their functionality accordingly.
Likewise, in a clustered computer system, group members are typically capable of detecting the version of cluster infrastructure software on the nodes upon which such group members reside. However, it is important to note that, since cluster infrastructure software must be resident on each node of a cluster, the version of the cluster infrastructure software on each node may differ from node to node. To accommodate for any differences in functionality, many clustered computer systems require that when a group is formed in a clustered computer system, the members select as the current “cluster version” used by the group, the lowest version of the cluster infrastructure software that is installed in the system. Moreover, once the group is created, the cluster version used by the group is set, and cannot thereafter be changed without restarting the group. In addition, whenever any new member joins the group, the member will be informed of the current cluster version, and thus select the same cluster version as that used by the other members of the group.
Thus, for example, if a clustered computer system has three nodes where the cluster infrastructure software is version 2.0, and one node where the cluster infrastructure software is version 1.0, the cluster version used by a newly created group will be version 1.0. Likewise, a new member added to the group will be informed that the cluster version used by that group is version 1.0, regardless of the cluster version capable of being used by the new member.
The process of upgrading the operating system used in a computer is relatively straightforward and well known. Prior to upgrading an operating system, typically all applications running on a computer are shutdown. A new version of the operating system is then installed and the computer is restarted. Applications thereafter are restarted. When these applications are restarted, they may detect the new version of the operating system, and thereafter utilize any new functions made available by the new version.
Conventional clustered computer systems handle cluster infrastructure software upgrades in a similar manner, and require that all groups end and then restart under the new cluster infrastructure version. However, ending groups for the purpose of upgrading the cluster infrastructure software is inconsistent with a primary goal of a clustered computer system—that of maintaining constant availability. Despite the fact that individual nodes may leave or join a cluster at any given time, and that the group members residing thereon may leave or join their respective groups while a cluster remains active, until a group is shut down and restarted, the cluster version used by that group cannot be changed.
Given the overriding desire to maximize system availability in a clustered computer system, there is a significant need to eliminate as many instances as possible where groups need to be shut down. Therefore, a significant need exists in the art for a manner of upgrading the version of cluster infrastructure software used by a group in a clustered computer system without having to shut down the group.