In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users. At the same time, the cost of computing resources has consistently declined, so that information which was too expensive to gather, store and process a few years ago, is now economically feasible to manipulate via computer. The reduced cost of information processing drives increasing productivity in a snowballing effect, because product designs, manufacturing processes, resource scheduling, administrative chores, and many other tasks, are made more efficient.
Early computer systems were isolated machines, in which data was input manually or from storage media and output generated to storage media or human perceptible form. While useful in their day, these systems were extremely limited in their ability to access and share information. As computers became more capable, and the ability to store vast amounts of digital data became prevalent, the desirability of communicating with other computer systems and sharing information became manifest. This demand for sharing information led to a growth of computer networks, including the Internet. It is now rare to find a general purpose computer system having no access to a network for communicating with other computer systems, although many special-purpose digital devices still operate in isolated environments.
This evolution of isolated computers to networked devices and shared information has proceeded to cloud computing and clustering. A “cloud” and a “cluster” are related, not necessarily mutually exclusive, vehicles for networked computing. A “cloud” is a collection of computing hardware and software resources which are accessible on demand from a remote location to perform useful work on behalf of a client. The client contracts to obtain virtualized computing serves from a cloud provider, without any specification of the particular physical computer systems which will provide the contracted service. This virtualization enables a cloud provider to re-allocate the physical computer resources as convenient, without involvement of the client. Cloud computing has thus been analogized to an electric utility, in which the customer purchases electric power without any knowledge or concern how the power is generated.
A “cluster” generally refers to a computer system organization in which multiple computers, also called “nodes”, are networked together to cooperatively perform computing tasks. An important aspect of a computer cluster is that all of the nodes in the cluster present a single system image—that is, from the perspective of a user, the nodes in a cluster appear collectively as single computer, or entity.
Clustering is often used in relatively large multi-user computer systems where high performance and/or reliability are of concern. For example, clustering may be used to provide redundancy, or fault tolerance, so that, should any node in a cluster fail, the operations previously performed by that node will be handled by other nodes in the cluster. Clustering is also used to increase overall performance, since multiple nodes can often handle a larger number of tasks in parallel than a single computer otherwise could. Often, load balancing can also be used to ensure that tasks are distributed fairly among nodes to prevent individual nodes from becoming overloaded and therefore maximize overall system performance. One specific application of clustering, for example, is in providing multi-user access to a shared resource such as a database or a storage device, since multiple nodes can handle a comparatively large number of user access requests, and since the shared resource is typically still available to users even upon the failure of any given node in the cluster.
Clusters typically handle computing tasks through the performance of “jobs” or “processes” within individual nodes. In some instances, jobs being performed by different nodes cooperate with one another to handle a computing task. Such cooperative jobs are typically capable of communicating with one another, and are typically managed in a cluster using a logical entity known as a “group.” A group is typically assigned some form of identifier, and each job in the group is tagged with that identifier to indicate its membership in the group.
Member jobs in a group typically communicate with one another using an ordered message-based scheme, where the specific ordering of messages sent between group members is maintained so that every member sees messages sent by other members in the same order as every other member, thus ensuring synchronization between nodes. Request for operations to be performed by the members of a group are often referred to as “protocols”, and it is typically through the use of one or more protocols that tasks are cooperatively performed by the members of a group. One type of protocol, for example, is a membership change protocol, which is used to update the membership of a particular group of member jobs, e.g., when a member job needs to be added to or removed from the group.
Clustered computer systems also typically rely on some form of cluster infrastructure software that is resident on each node in the cluster, and that provides various support services that group members utilize in connection with performing tasks. Cluster infrastructure may be integrated with an operating system or a separate software product or module executing above low-level operating system kernel functions such as dispatching and address translation, but it typically executes at a level below the applications it supports and manages the execution of group members and provides a programming interface through which jobs can invoke various cluster-related support functions (e.g., to pass messages between group members, to change group membership, etc.)
Cluster infrastructure software, as well as other software executing in a cluster, may be upgraded from time to time, each release of new software being associated with a “version” that distinguishes it from prior releases of the same software. Upgrades may be released to correct errors, security exposures, and the like in the software, or to add new functions and capabilities. Upgrading software in a cluster generally requires that the upgraded software version be installed on each individual system or “node” of the cluster.
For various reasons of consistency, a cluster may be architecturally constrained to execute a single respective common version of each software application or cluster infrastructure software installed on the multiple systems of the cluster, or to disable newly added functions and capabilities until the function or capability is available on a sufficient number of systems of the cluster. It is desirable to manage upgrades so that, when a software upgrade becomes available, all systems of the cluster are upgraded to a common version. However, upgrade is often a disruptive process, in which the functions provided by the software are temporarily unavailable to users. If all systems of a cluster are simultaneously halted, suspended or otherwise interrupted while new software is loaded, the cluster's functions will be unavailable for some period of time. This is often unacceptable to a business operating or using a cluster which is intended to be continuously available.
It is possible to centrally manage the software upgrade process by upgrading systems one at a time or not all simultaneously, in such a manner that, at any given time, one or more systems are available to provide essential functions, and to switch operation to the upgraded software version when it has been loaded on a sufficient number of systems to provide essential function (this number being referred to as a “quorum”). However, due to interleaved dependencies among multiple software products, this process can be very complex, requiring significant complexity in any central upgrade manager. The upgrade manager will typically itself need to be upgraded from time to time to take into account all software dependencies. Furthermore, as a result of multiple dependencies, it may have the collateral effect of delaying upgrade or causing unneeded software to be loaded and/or upgraded in some systems merely to support upgrade of another software product.
Conventional multi-system management tools are overly complex and/or do not always optimally manage upgrade processes in a clustered or other complex multi-server environment. With the growth in clustering and other forms of shared and distributed use of computing resources, a need exists for improved techniques for managing software upgrades among multiple systems, and in particular, for managing upgrading of multiple computer systems of a continuous availability cluster consistent with all software dependencies.