The present invention relates to a checkpoint acquisition accelerating apparatus suitably applicable to a computer including a cache memory having a snoop function for maintaining data coherency and a computer system with a checkpoint recovery mechanism using such an apparatus.
This application is based on Japanese Patent Application No. 08-234321, filed Sep. 4, 1996, the content of which is incorporated herein by reference.
In order to improve the reliability of a computer system, a checkpoint is acquired in a main memory at regular intervals during the normal data processing, and if the computer detects some faults, the normal data processing is resumed by rolling back to the most recent checkpoint. This method is called the checkpoint/recovery method and can be roughly classified into the three types described below.
(1) A method used mainly with a data base management system using two computers in which if one of the computers goes out of order, the other computer takes over to the data base processing in order to prevent the loss of the data and maintain the data integrity.
(2) A method in which an application program is executed in duplicate in different computers as a primary process and a shadow process, respectively. If the primary process goes out of order because of a hardware failure, the shadow process takes over the role of the primary process. From the user, the execution of the application program appears to be continued without being interrupted. (3) A method in which if a fault occurs in a computer, the computer manages to avoid running out of order. The fault is not substantially transparent to the user and the application program seems to be executed as if the fault had not occurred.
According to the third checkpoint/recovery method (3), the normal data processing is resumed from the most recent checkpoint, and therefore a checkpoint is required to be stored in a memory immune to breakdown due to a fault. Such a memory unit is called a stable memory such as a duplicated memory.
In a computer of the third checkpoint and recovery type (3), as shown in FIG. 1, the normal data processing of each processor is temporarily suspended at regular time intervals to perform the checkpoint acquisition (t1), upon completion of which the normal data processing is resumed from the point of interruption (t2). If any fault occurs (t3), the processors perform the fault recovery. When the fault recovery is completed, the normal data processing is resumed (t4) after restoring the main memory to the state of the most recent checkpoint (t2).
Now, the relation between the cache memory, the main memory and the checkpoint is described below in the checkpoint/recovery method.
(Normal data processing)
To cope with a fault, to restore the main memory to the state of the most recent checkpoint.
(Checkpoint acquisition)
All the updated data stored in the cache memory are written-back into the main memory.
(Restoration from a fault)
It is necessary to restore the data in the main memory which has been updated after the most recent checkpoint to the data of the most recent checkpoint.
A specific example of a fault tolerant computer employing the checkpoint/recovery method is disclosed in Philip A. Bernstein, "Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing", IEEE Computer, Vol. 21, No. 2, 1988.
In this Sequoia computer, when a processor updates data during the period of normal data processing, the updated data is stored in the cache memory and never written-back to the main memory. With the starting of checkpoint acquisition, the updated data stored in the cache memory is written-back to the main memory. In case where a fault occurs in the computer, the cache memory is invalidated so that the normal data processing can be resumed from the state of the most recent checkpoint. This mechanism can be summarized as follows in terms of the above-mentioned relation among the cache memory, the main memory and the checkpoint.
(Normal data processing)
The data updated by the processor is not written-back to the main memory before starting the checkpoint acquisition.
(Checkpoint acquisition)
The updated data stored in the cache memory are all written-back to the main memory.
(Restoration of the main memory)
All that is required is to invalidate the cache memory.
Also, the Sequoia computer comprises a special cache memory for realizing the checkpoint/recovery recovery method. The reason is that an ordinary cache memory of write-through type or copy-back type can not be controlled to perform the operation in which "the data updated by a processor during the normal data processing is not written-back to the main memory before the beginning of a checkpoint acquisition". Therefore, a special cache memory is required.
A second specific example of a fault-tolerant computer employing the checkpoint/recovery recovery method is disclosed in U.S. Pat. No. 4,740,969 entitled "Method & Apparatus for Recovering from Hardware Faults". In this specific example, the following processes are executed.
(Normal data processing)
When data are loaded from the main memory to the cache memory, the data and the address thereof are stored into a log memory.
(Checkpoint acquisition)
Not described.
(Restoration from fault)
The main memory is restored to the state of the most recent checkpoint using the above-mentioned address and data.
In this way, the Sequoia computer requires a special cache memory for the checkpoint and recovery, and thus poses the problem that a rapid technical revolution of the processor technology can hardly be caught up with.
The method disclosed in U.S. Pat. No. 4,740,969 poses the problem that the data acquired during the normal data processing is to much since the address and data are stored at each time of data transfer from the main memory to the cache memory.