1. FIELD OF THE INVENTION
The present invention relates to a system and method for automatically converting non-fault tolerant software programs into fault tolerant software programs. The method can generally be implemented on any computer system that supports primary and backup software processes. Since the non-fault tolerant to fault tolerant conversion process can be implemented automatically, the programmer may write non-fault tolerant programs and the user can use the resulting programs in a fault tolerant manner without either being aware of or understanding the techniques and mechanisms used to achieve fault tolerance.
2. DESCRIPTION OF RELATED ART
A fault tolerant program is one whose normal functioning is not disrupted by the failure of a single CPU. One known technique for achieving fault tolerance employs a redundant process pair. The redundant fault tolerant process pair may consist of a primary application program process running on one CPU and a backup application process, in standby mode, configured to run on another CPU. The primary process actually does the work that the program is supposed to be doing; the backup program is dormant and does not actually run while the primary process is functioning properly. The backup program simply waits to be notified that the primary process has failed. While the backup process is dormant, checkpointing techniques periodically synchronize the backup process with the primary process. Prior known checkpointing techniques send messages, containing information about changes in the state of the primary process, from the primary process to the dormant backup process. Immediately after each checkpoint, the primary and backup processes are in the same state. Therefore, if the primary process fails, the backup process is started and simply begins executing at the instruction immediately following the most recent checkpoint.
FIG. 1 is a flow chart diagram of an exemplary software program. The specific commands shown in FIG. 1 are not of any particular significance, except to the extent the commands illustrate the methodology for implementing fault tolerance using checkpoints. The left hand side of FIG. 1 illustrates the flow of process or method steps for a primary process. The right hand side of FIG. 1 illustrates, abstractly, a dormant backup process 11 including the same application program code as the primary process, but resident on a different CPU than the primary process. At step 10 the primary process is loaded onto a first processor, a backup process is loaded onto a separate second processor and the primary process transfers initial checkpointing information to the backup process so that both processes are synchronized. Step 12 represents, generically, a series of process steps that may be carried out by the user program. Of course, these steps would be different for different application programs. They are not directly germane to achieving fault tolerance and, accordingly, these steps are indicated abstractly at reference numeral 14. At step 16 the program executes an input instruction, Read (file 1, K). Subsequently, the primary process executes a checkpoint instruction 18. As previously mentioned, a dormant backup process 11 exists on a separate CPU. The checkpointing command 18 transmits process state information from the primary process on one CPU to the backup process on the other CPU to again synchronize the states of the primary and backup processes. Upon receipt of the state information, the backup process 11 is altered to conform to the state of the primary process. Other then incorporating the new state information, the backup process 11 does not execute any instructions while the primary process continues to function properly. After the checkpointing instruction 18, the primary process executes subsequent instructions 20, 22 and eventually encounters a third checkpoint command 24. The checkpointing process is repeated so that the primary and backup processes are again synchronized. The arrows at reference numerals 26, 28 and 30 indicate the transmission of checkpointing information to the backup process 11. If the primary process fails at any point, for example at step 22, the backup process re-executes all commands beginning immediately after the most recent checkpoint, in this example checkpoint 18.
Typically, supervisory programming (or the primary process itself) is responsible for creating the backup process and checkpointing the initial state of the primary process to the backup process. Subsequently, the primary process must execute appropriate commands to request system messages about the state of the backup process and its associated CPU. If any of these system messages indicate that the backup process or associated CPU has failed, then the primary process must restart and reinitialize a backup process and checkpoint the entire current state of the primary process. Similarly, when the backup process takes over after failure of the primary, it (as the new primary) must create a new backup process and checkpoint the entire program state.
With traditional fault tolerant programming techniques, the programmer must know where to place checkpoints in the software program and must also know what things to checkpoint. Two basic kinds of information that must be checkpointed are the contents of memory and the state of open files. Memory checkpoints update the memory accessed by the backup process to reflect changes in the state of memory of the primary process since the last checkpoint. File checkpoints do the same for files. That is, file checkpoints put the primary and backup processes at the same logical position in the open file. File checkpoints also synchronize sequence numbers (known as "syncids") that are uniquely associated with each file access. As will be described in greater detail below, syncids are used to detect when a new primary process, which has taken over from a failed original primary process, duplicates a message that was previously issued by the original primary process.
As will be apparent from the forgoing discussion, the basic idea is that each checkpoint contains all modifications to memory and files that have occurred since the last checkpoint. Therefore, the programmer of software designed to run on a traditional system faces numerous difficulties in selecting what information must be checkpointed between the primary and backup processes and with tracking the modifications to the state of the primary process between checkpoints. For complex systems, this can be an extremely difficult and error prone procedure. Adding to this difficulty is the fact that, for a variety of reasons, there may be modifications to the state of the memory associated with the primary CPU that the code issuing the checkpoint does not know about. These changes are known as "hidden state." Hidden states generally occur when a program calls on the facilities of some preexisting software module that does not make information about changes in state available to the calling procedure. Currently popular software design methodologies favor this type of information hiding because it generally simplifies the design and coding of a software program. Unfortunately, however, hidden states do not mix well with fault tolerant programming. When the execution of such a module alters state information in a way that is not apparent at the interface it presents to the user program, the user program cannot know the state has been modified and, therefore, such modifications cannot be checkpointed between the primary and backup processes.
Traditional fault tolerant programming also presents the programmer with the difficulty of how to detect duplicate requests. When a backup process takes over, it resumes execution at the instruction immediately following the most recent checkpoint from the primary process. The backup process cannot know how far past the last checkpoint the primary process was when it failed. For each operation that the primary process completed after the last checkpoint, but before failure, the backup process must either be able to repeat the operation harmlessly, or it must be able to determine that the primary completed that operation. The backup process of the fault tolerant program must, therefore, ensure that it makes no duplicate requests that cannot be detected and must otherwise detect duplicate requests.
One technique for detecting duplicate requests uses syncids. For example, when the primary process fails after a write operation but before a checkpoint, the backup process will take over immediately following the previous checkpoint. Therefore, the write operation will be repeated. However, because of the checkpointing, the syncid of the write request will be the same when the backup executes the write operation as when the primary process first executed it. The disk process will detect that this is a duplicate write request because the write operation will repeat a previously used syncid. Therefore, instead of reexecuting the write command, the disk process will return the same reply to the duplicate request that it did the first time it executed the write operation.
"Syncdepth" refers to the number of requests that a process will recognize as a duplicate. If a process supports a syncdepth of one, that means it will recognize a duplicate of the most recent request; syncdepth 2 means that the process will recognize duplicates of the two most recent requests. The syncid is said to "roll over" when a requestor exceeds the syncdepth between checkpoints. Therefore, to maintain fault tolerance, programers must write programs that track the number of calls to every process and then checkpoint the backup process immediately before roll over.
As will be apparent from the above discussion, in any software program, particularly the more complicated programs, there may be many different failure modes. Thus, testing fault tolerant programs can be extremely difficult, complicated and time consuming. As a result, many fault tolerant programs are insufficiently tested. A number of schemes have been suggested to simplify the creation and testing of fault tolerant programs, however, these generally do not solve the fundamental problem: software programmers seeking to create fault tolerant programs must obtain knowledge sufficient to decide what and when to checkpoint, and design tests that thoroughly explore all of the various failure modes that the program may encounter. As this is difficult, costly and time consuming, there is a great need for a system and process that not only addresses the checkpointing problem, but also makes it unnecessary to test application program code for fault tolerance.