1. Field of the Invention
The present invention relates to a computer system having a check-point and restart function, and more particularly, to a computer system making it unnecessary to perform lock-run-out sequence at restart time since special process for taking check points is provided therein, and which can therefore be constructed at low cost.
The invention also relates to a computer system which comprises a multi-processor system having a check-point and restart function and in which, even if some of the processors incorporated in the multi-processor system fail to operate, the remaining possessors can operate continuously.
2. Description of the Related Art
A conventional computer system will be described, which is proceeding process while taking check points in a normally state and which is restarted from the last check point taken before a failure takes place, thereby to eliminate the failure.
In this computer system, process is executed while taking check points of the system during normal operation. Further, in the case where a failure or the like occurs, the system is returned to a point from a last check point taken.
These check points are taken in the following cases:
(1) where taking of a check point is clearly instructed in a code. PA1 (2) where a predetermined time period has passed after the last check point is taken. PA1 (3) where an event (interruption) occurs which demands taking of a check point.
The conditions as described above can occur at an arbitrary time point while executing a program. Conventionally, at the time when any of these conditions occurs, i.e., at an arbitrary time during execution of a program, the check points are taken.
FIG. 1 shows a state in which check-point processing is performed on a way while a processor executes normal processing. At time t1, in the interruption processing ((1) in FIG. 1) accompanying occurrence of an event which will demand taking of a check point during programs, check point processing ((2) in FIG. 1) is performed.
At time t2, the check point processing (i.e., (4) in FIG. 1) is performed during the timer interrupting process (i.e., (3) in FIG. 1) which is started upon a lapse of a predetermined time after the last check point has been taken. That is, the check points are taken during an arbitrary process.
FIG. 2 shows a state in which a failure occurs on the way of proceeding process while taking check points, and the process is re-executed from the last check point. If a failure occurs after check points are taken at the time t1 and time t2 ((1) in FIG. 2), the process is executed again from the check point (t2) taken last ((2) in FIG. 2).
In general cases, however, process normally includes "a processing portion to be treated as a certain set of units" in consideration of returning where a failure occurs. One of such processing portion is known as a "lock-run-out region".
The lock-run-out region means a block which must "run out" in failure recovery processing before the process recovers a regular condition in case where the system is started again from a check point taken during the region, although a check point can be taken therein. This is the block in which spin-lock is taken.
Process which is acquiring spin-locks cannot be preempted. When taking this spin-lock, attention must be paid such that no dead lock occurs. Normally, the system is designed, for convenience, such that a leveled lock class is added to each spin lock, and such that in the case where another spin lock is taken in a situation in which a spin lock has already been taken, only spin locks of those lock classes which are much lower than the lowest level among the levels of the lock classes of the spin lock which is presently taken. By controlling the acquisition of spin locks in this manner, the order of taking locks in each processor is guaranteed.
For example, in case where the levels of the lock classes are set as shown in FIG. 3, where "process A" and "process D" accompanied by lock operations are executed at the same time, and where both of the locks must be acquired overlapped at the same time, each processor must follows the order that a lock (level L5) of the "process D" is acquired, at first, and a lock of the "process A" is then taken.
The reason why it is necessary to make locks run out will now be explained with reference to FIGS. 4 and 5.
FIG. 4 shows an example in which a dead lock occurs since lock-run-out is not executed.
Here, it is assumed that a check point is taken in the situation that a process T0 and a process T1 are respectively executed in processors (0) and (1) and that the process T0 takes spin lock L5 and L3 while the process T1 takes the spin lock L4.
Further, consideration will be taken into a case in which a permanent failure thereafter occurs in a processor (0). In this case, the processor (1) is the only one processor which regularly operates, and therefore, the processes T0 and T1 must be executed by the processor (1). The spin locks which are currently acquired by the processes T0 and T1 can be recognized. However, it is not possible to predict how the processes T0 and T1 will then behave, i.e., how these processes will then acquire spin locks.
It is then supposed that recovery is executed and the process T0 which currently acquires a spin lock of a lower level is dispatched to the processor (1). Further, it is supposed that this process T0 releases the spin lock L3 which has already been acquired, and thereafter, is going to newly acquire a spin lock L4. However, since this spin lock L4 has already been acquired by the process T1 which had been executed before a failure occurred, the process T0 cannot acquire this spin lock eternally. Consequently, a dead lock occurs. This problem arises because the order of acquiring spin locks which has been guaranteed by each processor is broken since one processor is broken although the order of taking spin locks has been guaranteed by each processor.
A run-out function of a lock has been known as a method for solving this problem. This function operates to release all the spin locks acquired when a check point is taken before returning from the check point, and to bring all the processes into a condition not depending on any particular processor. This function operates in the following procedure.
(1) The one process is selected which has a spin lock of which level is lowest among levels of those spin locks which have been acquired, when taking a check point.
(2) A processor is assumed as the processor which has been carrying out the selected process, and the process is executed until the spin lock of this process is released.
(3) In the processing of releasing the spin lock, whether or not the process acquiring the spin lock still exists is investigated.
(4) If exists, the processing is repeated from the respect (1). If not, the processing of the lock-run-out is terminated.
Specifically, if spin locks are acquired as shown in FIG. 5A, the process T0 is selected at first (since the level of L3 is the lowest), and the process T0 is executed until the spin lock L3 is released.
In the next, the process T1 is selected which acquires L4 of which level is lowest (a s shown in FIG. 5B), and further, the process T0 is selected which acquires L5, after the releasing of the process T1 (as shown in FIG. 5C), thereby completing the lock-run-out. After the lock-run-out is completed, the system executes restarting.
In order to realize the lock-run-out processing executed in this procedure, the processing of releasing the spin-lock must be arranged so as to call a special dispatch mechanism during the lock-run-out processing.
Thus, in a conventional method of taking a check point, a processing portion such as a lock-run-out region is extracted in a software (OS: operating system), and the special dispatch mechanism as described above must be equipped in order to protect a "set of units" as also described above.
Therefore, if the manufacturing cost of the computer system unavoidably increases, the software installation will be limited.