Many computer controlled systems, such as telecommunication switching systems, aviation air traffic control, and banking and financial services impose stringent reliability and data availability constraints on computer platforms. Many of these applications require a system to be available 24 hours a day and 7 days a week (24.times.7 system availability). However, the 24.times.7 requirement cannot be achieved by hardware fault-tolerance alone. The software for these applications is usually very complex and, as a result, likely to contain faults or bugs.
Updating software programs to fix program bugs or adding new features to programs is a routine aspect of software controlled system evolution. Traditionally, computer controlled application availability and reliability has been improved by either tolerating software faults on-line or by taking the system off-line to remove software faults.
Even if a computer program has been designed to be bug-free and is bug free, updating a computer program to add new features, or to accommodate new hardware can affect system reliability. Without an on-line update mechanism, an application process typically has to be shut down during a software update and cannot render the services provided by the application. As a result, system availability might be lost during a software update.
A challenge in achieving 24.times.7 system availability is to provide the ability to perform on-line software updates so that the services provided by the application program need not be interrupted while the software update is in progress. A number of checkpointing libraries and tools exist that can checkpoint data between two processes, however, these tools assume that data structures in two different versions of a computer program are identical and hence, they cannot be used to update software where a later software version uses data structures different from an earlier version.