The use of clusters of computers systems is becoming more widespread. As used herein, a cluster refers to a collection of computer systems that can function independently and that are connected by a communications network and optionally by multi-hosted disks.
A desirable attribute of a network is that it provides high availability. High availability can be provided by using fault-tolerant systems, but such systems can be expensive. Clusters take advantage of commodity (e.g., off-the-shelf) hardware to provide high availability at a lower cost.
Generally speaking, a cluster is created out of a collection of computer systems running a common cluster operating system. Prior Art FIG. 1 is a block diagram showing the infrastructure of a cluster operating system according to one embodiment of the prior art. Here, the cluster comprises a client node 110 coupled to a server node 120; however, a cluster will typically include many such nodes. The client node 110 and the server node 120 may be co-located on a single computer system or they may reside on different computer systems in communication with each other.
In a cluster operating system such as that illustrated by Prior Art FIG. 1, software is organized into components, typically objects and specifically CORBA (Common Object Request Broker Architecture) objects; however, the objects in a cluster operating system need not be strictly CORBA-compliant. The public interface for each server (implementation) object is provided by an Interface Definition Language (IDL) specification. An IDL compiler generates a unique identifier (a “TypeID”) for each interface. An object and communication subsystem, called the Object Request Broker (ORB), provides the functionality to connect clients and servers. Client access to server objects is via method invocations using CORBA object references. The IDL compiler also generates both server-side and client-side support for method invocations.
On the server side, another software component—the server handler—supports operations on the implementation object. Each object is associated with a handler on the server side and a handler on the client side (the client handler). The handler marshals and unmarshals references to the implementation object, among other functions.
Objects do not change locations; instead, object references are passed from node to node. Another server-side software component—the server xdoor (extended door)—encapsulates information about the server object supported by this xdoor and information about the location of object references on other nodes for this specific server object. Essentially, an xdoor is a mechanism by which a thread in one domain may place a call to an object in another domain.
On the client side, a similar infrastructure is used. A client xdoor encapsulates server location information. A client handler supports infrastructure operations. A proxy acts as a representative of the implementation object on the server. The proxy forwards method invocations to the implementation object on the server.
As mentioned above, a desirable attribute of a cluster is its high availability. To maintain high availability during a software change across the cluster—for example, a change to the cluster operating system—rolling upgrades are commonly used.
A typical rolling upgrade works as follows. One computer system in the cluster is taken out of service and new software installed. The computer system is then returned to service. The process is repeated for each computer system in the cluster. Thus, clusters can continue to provide high availability during software changes because only one computer system is shut down to install software, leaving other systems in the cluster up and running. On larger clusters that can withstand multiple systems being out of service, multiple computer systems can be upgraded at the same time.
A disadvantage to the rolling upgrade approach is that, until the rolling upgrade is completed, different systems in the cluster will be using different versions of software. This may lead to contention or compatibility issues when the systems try to communicate.
Another approach for implementing software changes across a cluster involves de-activating (quiescing) the software components to be changed, substituting new software components, and then resuming activities with the new software components. However, a quiescing operation needs to overcome issues such as time constraints for completing the software change and deadlock resulting from attempts to use the quiescent software. This type of approach also has to resolve a wide range of both inter-node and intra-node interactions, and does not address the issues associated with the use of different versions of software on different computer systems.
Thus, a method and/or system for implementing software changes across a cluster (or network) of computer systems that does not engender the problems described above, or that is able to overcome those problems, would be advantageous. The present invention provides a novel solution to those problems.