The invention relates to a method and mechanism for performing rolling software upgrades to a distributed computing system.
Over time, many types of software applications will undergo some sort of change. These changes may occur for a variety of reasons. For example, given the complexity of modern software applications, it is well known that most software contains coding errors or “bugs” that will need correction. One reason to upgrade from an earlier version to a later version of a software application is to correct errors or bugs that may exist in the earlier version. Another reason for changing a software application is to introduce improvements to the operation or functionality of the software application.
A “rolling upgrade” refers to the process of performing software upgrades to a live existing software installation in a distributed environment in which the individual instances, nodes, or entities of the distributed system (referred to herein as “members”) are upgraded in a staggered manner. This form of upgrade ensures availability of the application during software upgrades, and thus minimizes or eliminates planned downtime while contributing to high availability goals. As used herein, the term member may encompass either a single instance/node/entity or a collection of such instances/nodes/entities.
At each member, there are numerous ways to upgrade a software application from an earlier version to a later version. A common approach is for a software developer to create patches and patch sets that are applied to a copy of the software binary or executable. Another common approach is to create a new object having the same location reference. Tools are often provided to perform the software upgrades or installations.
Performing an upgrade or change to an existing software application typically requires a shutdown of either/both the member or software enterprise. For example the upgrade can be performed by shutting down the software, implementing the upgrade, and then bringing the member back up so that availability is restored.
With modern software, it can be anticipated that software developers will provide upgrades and changes on an ongoing basis. In fact, many IT (“information technology”) departments will periodically schedule planned events to perform upgrades to their software installations. These events could result in significant planned downtimes. It is desirable to limit the effects of these downtimes as much as possible since they could affect the availability of mission critical systems, potentially resulting in productivity and financial losses for organizations.
If the system being upgraded is a distributed system having multiple independent members where the software is located in the members' local directories, then in one approach, the upgrade can be performed individually at each member so that other members do not suffer downtime while the affected member is being upgraded. However, problems arise with this approach if it is implemented in networked and shared filesystem environments in which multiple members operate with the same shared software installations. Some examples of this type of configuration are when multiple members access the same software installation at a shared filesystem using the NFS (network file system) mechanism or in the Cluster File System such as the Oracle Cluster File System (OCFS) available from Oracle Corporation of Redwood Shores, Calif. With this type of architecture, since the application files are shared, performing a rolling upgrade could result in all members being shutdown during the upgrade process, resulting in total unavailability for the systems during the downtime.
For operating system (OS) upgrades, one approach for handling this is provided in the Tru64/TruCluster system which offers OS level support to perform rolling upgrades on their Cluster File System. The TruCluster model uses tagged files and kernel parameters to support multi-versioning and version switching. However, this approach may result in inefficient performance involving 2n−1 reboots to the networked members when performing the OS upgrades where n is the number of members being upgraded.
Therefore, to address these and other problems, what is described herein is an improved method and mechanism for performing rolling upgrades, e.g., to shared software installations in a distributed environment. The present approach eliminates or minimizes extraneous downtime when performing a rolling upgrade, thereby improving performance and availability for users of the shared software installation. In one embodiment, a rolling upgrade is performed by defining a private symbolic link for each member that is upgraded to reference the upgraded version of the shared software installation. This approach can be performed upon any computing system, whether single node (e.g., a multi-instance application on a single computer) or a multi-node system (e.g., a cluster or network of stations).
Further details of aspects, objects, and advantages of the invention are described below in the detailed description, drawings, and claims.