1. Field of the Invention
This invention relates to a data processing system, network, and data processing method which increase reliability by executing processes using programs in a plurality of versions.
2. Description of the Prior Art
In systems, such as industrial systems, traffic control systems, and power plant systems such as a nuclear power plant, where ever-changing data is processed and the system is controlled based on the processing result, the safety of the system must be maintained under any condition.
This means that reliability is vital to data processing system devices such as computers or computer networks which are used in those systems. In particular, system errors have significant effects on those devices. System errors are caused by hardware errors or program bugs. Recently, as hardware reliability increases, program reliability has become more important. However, as programs become large and complicated, it is virtually impossible to create error-free programs.
To solve this problem, software techniques which make a program appear free of errors have been proposed even when the program has errors.
One of those techniques widely accepted is what we call a multiversionning method. This method puts a computer in the multiversionning mode to allow the programs in the computer to be run in the multiversionning mode. It enables the system to continue normal operation even if a system error occurs. However, running a program in the multiversionning mode requires that a plurality of program copies must be created. So, if the program has one or more bugs, multi-versioned programs stop due to the same bug, causing the computer or a part of system functions to stop. To solve this problem, the methods given below have been proposed:
(1) N versions program method
In this method, a plurality of designers create programs which perform the same function using different procedures. Thus, a plurality of programs, each with its own version, are created to perform the same function. This "N versions program method" allows a plurality of programs to be run in the computer concurrently. These programs, driven by the program called a driver which behaves just like an operating system (OS), are synchronized by the driver each time they reach pre-defined checkpoints. When the majority of programs produce the same result, that result is selected as a correct output.
(2) Recovery block method
This method is described below using program B and its alternate programs B' and B".
In this recovery block method, checkpoints, at which a predetermined amount of processing ends, are provided for program B and alternate programs B' and B", and the test (acceptance test) is made to check if the execution result of processing matches the desired value. First, program B is run, and the acceptance test is executed at a checkpoint to check if the execution result is acceptable. If the execution result of program B is acceptable, processing continues; otherwise, alternate program B' is started.
When the execution result is rejected, alternate program B' is started to perform alternate processing. At this time, the internal status at the preceding successful checkpoint, that is, the checkpoint data accepted by the acceptance test at the preceding checkpoint, is passed to alternate program B' for use in alternate processing. The result of this alternate processing is then checked by the acceptance test and, if it is rejected, alternate program B" is started. This processing is repeated until the execution result is accepted by the acceptance test or until there is no more alternate programs. Therefore, if the execution result of alternate program B" is also rejected, program B is determined to be unreliable.
(3) Self-checking method
Alternate programs B' and B" are started after program B fails in the acceptance test in the recovery block method described above, while alternate programs B' and B" are run concurrently with program B in the self-checking method. Note that, in the self-checking method, alternate program B' takes over the processing of program B and outputs data to external programs only after the acceptance test of program B fails,
3. Problems to Be Solved by the Invention
The methods described above have the following problems. In the "N versions program method", when a plurality of programs in different versions are run concurrently, the system must wait, at each checkpoint, for the slowest program to end. Therefore, during daily operation, the overall system performance is determined by the processing performance of the slowest program.
In the "recovery block method" or "self-checking method", an alternate program takes over processing only after the program fails in the acceptance test. This take-over processing requires time and delays program processing. In addition, since an alternate program usually places emphasis on less bugs rather than on performance, program B' is slower than program B during concurrent operation. This loses the advantage of concurrent operation. An attempt to run alternate programs B' and B" concurrently with, and as fast as, program B will result in the disadvantage associated with the "N versions method".
Conventional program high-reliability methods are intended for increasing the software reliability rather than for detecting and recovering from hardware failures. There is a method in which the same program is run in other computers concurrently so that the program keeps running even when an error occurs in one of computers. However, if it is difficult to determine whether the error is a software error or a hardware error, the conventional software high-reliability method does not solve the problem; that is, when a hardware error occurs in a system where this method is employed, control is passed to a poorer-performance alternate program and, as a result, the performance is degraded.
Even if it is possible to determine whether a system error is a hardware error or a software error, the program for that determination must always be active. In addition, there is a possibility that a hardware error and a software error occur at the same time. This makes the determination and the subsequent take-over processing more difficult.