This invention relates to an error recovery system for use in an electronic digital computer system comprising a plurality of processors.
In a conventional electronic computer system disclosed in U.S. Pat. No. 3,248,697 issued to Howard C. Montgomery, assigned to International Business Machines Corporation, use is made of a plurality of processors each of which allows each instruction to be retried therein in order to recover an error in one of the processors. This system is effective to remove transient errors but is invalid against occurrence of long-lived or physical errors.
In another conventional electronic computer system disclosed by J. A. Arulpragasam in British Patent Specification No. 1,163,859, a pair of error recovery units are coupled to a pair of processors and a main memory. On occurrence of an error in one of the processors, this system transfers, from the one processor to the main memory through one of the error recovery units, those status data which appear in the one processor. Thereafter, the other processor takes over operation of the one processor by accessing the status data stored in the main memory. It is possible for this system to recover long-lived errors also in each processor by taking over operation of each processor by the other processor.
With this system, the status data should inevitably be accompanied by control signals peculiar to the status data when the status data are sent from one processor into the main memory. Otherwise, the other processor cannot carry out the recovery operation. This is because no stored status data are taken over by the other processor without the control signals. However, the control signals cannot always be produced by each processor.
More particularly, each processor processes a plurality of programs some of which relate to tasks carried out in each processor and the others of which relate to a kernel or nucleus of the computer system. In the latter programs, a dispatcher program and an interruption handler make no interruption take place unless the system itself is broken down. This means that no control signals can be produced in each processor on occurrence of an error insofar as the dispatcher program and the interruption handler are concerned. Accordingly, the error cannot be recovered with the conventional computer system when it takes place during execution of the dispatcher program or the interruption handler.