The present invention relates generally to fault-tolerant data processing architectures that use pairs of processes to continue operation in the face of failure of a process or a processor on which a process is running. More particularly the invention relates to replacement of process pairs with different version without having to stop operation of the system or the process pair.
Today""s computing industry includes the concept of continued availability, promising a processing environment can be ready for use 24 hours a day, 7 days a week, 365 days a year. This promise is based upon a variety of fault tolerant architectures and techniques, among them being the clustered multiprocessor architectures and paradigms described in U.S. Pat. Nos. 4,817,091 and 5,751,932 to detect and continue in the face of errors or failures, or to quickly halt operation before the error can spread.
The quest for enhanced fault tolerant environments has resulted in the development of the xe2x80x9cprocess pairxe2x80x9d techniquexe2x80x94described in both of the above identified patents. Briefly, according to this technique, application software (xe2x80x9cprocessxe2x80x9d) may run on the multiple processor system (xe2x80x9cclusterxe2x80x9d) under the operating system as xe2x80x9cprocess-pairsxe2x80x9d that include a primary process and a backup process. The primary process runs on one of the processors of the cluster while the backup process runs on a different processor, and together they introduce a level of fault-tolerance into the execution of an application program. Instead of running as a single process, the program runs as two processes, one in each of the two different processors of the cluster. If one of the processes or processors fails for any reason, the second process continues execution with little or no noticeable interruption of service. The backup process may be active or passive. If active, it will actively participate in receiving and processing periodic updates to its backup state in response to checkpoint messages from the corresponding primary process of the pair. If passive, the backup process may do nothing more than receive the updates, and see that they are stored in locations that match the locations used by the primary process. The content of a checkpoint message can take the form of complete state update, or one that communicates only the changes from the previous checkpoint message. Whatever method is used to keep the backup up-to-date with its primary, the result should be the same so that in the event the backup is called upon to take over operation in place of the primary, it can do so from the last checkpoint before the primary failed or was lost.
Unfortunately, there are times when the application program, and therefore its process pair instantiations must be updated, modified, and/or changed. When this occurs it has been the practice to stop the system or stop the application, thereby diminishing the availability of the system. Thus, in order to provide continued availability in the face of changes and upgrades to the application software, there is needed a method of being able to perform updates on-line; that is, without stopping the system or the application program being updated (assuming the application program is running as a process pair).
The present invention takers advantage of the fault-tolerant capability of process pairs, i.e., the continued capacity to provide service as long as at least member of the pair exists. Thus, even if one member of the pair fails or is lost, the other remains in service and available for use in the on-line replacement approach of the present invention. The invention provides a simple and inexpensive method, therefore, of replacement of software that uses the process pair model.
Broadly the invention involves purposely stopping one of the processes, restarting the stopped process with a different version of the application program, and then repeating the action for the other process. In this manner, the code being executed is updated without ever stopping the process pair, since at least one of the two processes is always available.
According to a preferred embodiment of the invention the method proceeds generally as follows. First, the primary process of the process pair to be replaced is notified to begin the replacement and told where the replacement application is located. The primary process responds to this request by first sending its backup process a request that the backup process stop operation. When the backup process has stopped, a new backup process (the xe2x80x9creplacement backup processxe2x80x9d) is created from the replacement application. Since the primary process and the replacement backup process are derived from different application programs, the code they are executing are different. Accordingly, the primary and replacement backup processes perform a handshake routine to check that the two code versions are sufficiently compatible to continue with and complete the replacement. If the check ensures sufficient compatibility, the replacement backup process is then updated by the primary process with all the state necessary for it to take over the function and operation of the primary. The roles of the replacement backup and the primary processes are switched so that the replacement backup process becomes the primary process, and the former primary now becomes the backup process. During this phase of the replacement the primary process, and now the replacement backup process, have functioned normally to respond to requests and perform whatever operations were expected as if the replacement were never started.
The steps described above are repeated, this time with the replacement backup process performing the functions of the replacement described as the primary, and the original primary process functioning as the backup, concluding with replacement of the primary process with a replacement primary process, yielding a process pair that corresponds to the replacement application.
Since these steps may be readily executed by the processes themselves, and are difficult or impossible to execute from an external process on behalf of the process pair, a utility (command facility) assumes the responsibility for relaying a user command to start an on-line replacement to the target process pair. The command facility operates to respond to a user command to initiate the replacement by simply sending a message to the primary process of the pair to be replaced, requesting the replacement, and awaiting its completion. The actual replacement, however, is left to the involved process pair. Consequently, in addition to steps 1 through 5, the process pair must recognize the request from the command facility, and respond to the command facility concerning the success of the replacement
In an alternate embodiment of the invention, the steps of switching the roles of the primary and replacement backup processes and repeating the sequence to then replace the primary process are not used. Rather, the (old) primary process commits suicide after the replacement backup process is in place and updated. Loss of a one of the process pair will cause the replacement backup process to assume the role of the primary process. It will, according to conventional protocol, create a backup for itself, using the replacement application. This method of replacement is faster and less complex. Since only one request need be sent to the process pair to initiate replacement, the process pair would not go through two switches unless specifically designed to do so. On the other hand, the advantage to keeping steps of switching roles and repeating the replacement process is two fold: First, the two step method allows a break in the replacement where a user can try out the new version and then choose to back out of the replacement if any problems exist; second, a switch allows the primary to stabilize before starting the second half of the replacement in which it will stop its backup and therefore lose its safety net, while having the primary commit suicide leaves the new primary without a safety net immediately.