(1) Field of the Invention
The present invention relates to an error recovery system for use in an electronic digital computer system comprising a plurality of processors and, in particular, to such an error recovery system for use in a tightly coupled multiprocessor system.
(2) Description of the Prior Art
As an electronic digital computer system, a tightly coupled multiprocessor system is known in the prior art which comprises a main memory for storing a plurality of programs and a plurality of processors for processing the programs. Each program comprises a succession of instructions. As a known one of the tightly coupled multiprocessor system, ACOS 1500 manufactured by NEC Corporation is disclosed by M. Baba et al in NIKKEI ELECTRONICS No. 373 issued by Nikkei McGraw-Hill Co. in July 15, 1985 under the title of "A large computer ACOS 1500 having an increased processing speed by use of two-level cashe and an improvement of pipeline processing" (Reference 1).
On occurrence of an error or fault during execution of one instruction in one of the processors in ACOS 1500, the processor is made to retry execution of the instruction in order to recover the error in one of the processors, as disclosed in Reference 1. When the error is intermittent or transient, retry results in success. Then, the processor is continuously used in the computer system. When the error is a long lived, hardware, or physical error, retry is not well completed or ends in failure. Then, the processor is made into a checking condition and another of the processors is made to take over execution of the instruction by transferring status data in the faulty processor into another processor through the main memory.
An instruction fetched in one processor is executed by an executing means in the one processor under control of a microprogram comprising a succession of microsteps. In ACOS 1500, the microprogram has at least one predetermined checkpoint in the microsteps. When an error occurs in one of the processors, the microprogram is restarted from the last checkpoint before the error occurrence, as disclosed in Reference 1. When the microprogram restart ends in success, the one processor is continuously used as a normal processor in the system.
However, once a processor encounters an error, another error tends to again occur in the processor even after retry is well completed, which results in the system going down.
British Patent Specification No. 1,163,859 (Reference 2) by J. A. Arulpragasam discloses an error recovery system for, on occurrence of an error in one of the processors, making another processor take over execution of an instruction executed in the faulty processor by transferring status data in the faulty processor into another processor through the main memory.
U.S. Pat. No. 4,443,849 (Reference 3) by Ohwada assigned to Nippon Electric Co., Ltd. discloses an error recovery system for transferring the status data in the faulty processor to another processor through not the main memory but an additional storage.
However, the references 2 and 3 are silent as to the microprogram restart.