Methods for detecting and diagnosing faults have been employed since the fist generations of computers, although primarily in the form of dedicated redundancy. An example of this is the early stored program controlled switching machines used in the telephone industry in which all major components, including peripherals, memory, and processors are duplicated. The processors in these execute the same tasks in synchronism and continually compare output results to detect failures when they occur. When the outputs of the processors differ, complex testing routines are immediately called into operation to identify the faulty processor.
Later systems have employed triple and greater redundancy in conjunction with majority voting schemes to both detect faults and identify faulty components at the same time. U.S. Pat. No. 4,583,224, entitled FAULT TOLERABLE REDUNDANCY CONTROL, issued to Gotoh Yoshimi et al., is illustrative of this type of system.
The capabilities and speed of processing components are increasing dramatically as their size and cost decrease. These rapid advances, along with the ever-growing need for faster computations, have led to the advent of multiprocessor systems consisting of large numbers of processing elements. The multiprocessors of these systems collectively are able to perform sophisticated tasks at accelerated rates. The growing complexity of multiprocessor systems and their vital applications make the ability to detect and diagnose faults a critical issue. The dedicated redundancy fault detection techniques of the past are clearly inadequate and too expensive for use in present multiprocessor systems.
A general summary of the art of fault-tolerant computing in multiprocessor systems is given in FAULT DETECTION AND CORRECTION IN ARRAY COMPUTERS FOR IMAGE PROCESSING, W. R. Moore, IEEE Proceedings, Vol. 129, No. 6, November 1982. One detailed approach to the above problem is disclosed in U.S. Pat. No. 4,356,546, entitled FAULT-TOLERANT MULTI-COMPUTER SYSTEM, issued to Freedman et al. In this teaching, each task to be executed is assigned to more than the one processor of the system. In relevant part, it appears that each processor executing a task reports its results to all other processors of the system; if at least three processors execute the same task, majority voting is used, among other techniques, to detect faulty processors. It is not entirely clear what is done if and when only two processors execute the same task. Thus, it is seen that this technique requires the use of multiple processors for every system task. While useful and effective, this technique therefore reduces the overall capacity of the system as a whole.