1. Field of the Invention
The present invention relates to the field of fault-tolerant computer systems. More specifically, the present invention relates to the problem of upgrading a running software process without reducing the level of failure protection provided by redundant copies of the software process during the upgrade. In particular, the present invention avoids introducing a single point of failure during the upgrade of fault-tolerant software that usually employs a single backup copy of the software to provide protection against failure of the primary copy during normal operation.
2. Description of Prior Art
Fault-tolerant computer systems use a variety of techniques to provide highly-available systems for use in safety-critical or mission-critical environments. Many systems use software fault-tolerance to provide redundant backup copies of a software process. One such copy is designated the primary copy of the software process and replicates its internal state to the backup copies so that one of the backup copies can take over as the primary copy if the primary copy fails.
A key feature of highly-available systems is to be able to replace the running version of a software process without interrupting the service provided by that software process. This can be achieved using software fault-tolerance. For example, if a system uses one backup copy and one primary copy, known as 1:1 redundancy, the upgrade can be achieved by stopping the backup copy of the software process and replacing it with the new version of the software, allowing it to synchronize with the primary copy, then forcing a failure of the primary copy in order that the new backup copy becomes the primary. A replacement for the old primary copy is then started using the new version of the software. Once this has synchronized with the (new) primary copy, the upgrade is complete and normal operation has been restored using the new version of the software.
However, this approach to software upgrade compromises the fault-tolerance of the system by introducing a single point of failure in the 1:1 redundancy case from the point when the original backup copy is stopped until the upgrade is complete. If a hardware or software fault is encountered during this time that causes the then primary copy of the software process to fail, there is no active and synchronized backup copy that can take over the function of the failed primary. This lack of failure protection may be unacceptable in some environments, such as telecommunications equipment.
The failure protection across an upgrade can be improved by running more than one backup copy, but this requires more system processor and memory resources for the additional backup copies and slows normal operation by requiring replication of internal state to more than one backup.
References
See U.S. Pat. No. 5,751,574, Loebig; U.S. Pat. No. 5,410,703, Nilsson et al.; and U.S. Pat. No. 4,954,941, Redman.
The present invention avoids the reduction in failure protection during software upgrade of a redundant system by dynamically starting an additional backup copy of the new software version as the first step of the upgrade operation. This ensures that the failure protection during the upgrade operation is at least as good as that provided during normal operation. Though the present invention is of most use for 1:1 redundant systems, where failure protection is lost during the upgrade, it may also be applied to systems using more backup copies or, indeed, to a system where no backup copy is provided during normal operation but it is necessary to achieve software upgrades without impacting system operation.
The present invention has the following advantages over prior art:
The present invention does not compromise the fault-tolerance coverage provided for a software process while that software process is being upgraded to a replacement version.
The present invention does not require use of more than one backup copy of a software process during normal operation, which avoids the performance impact of replicating internal state to more than one backup during normal operation.
The present invention is not tied to any specific hardware or operating system and can be deployed in a heterogeneous distributed computer system.