As the speed of microprocessors continues to approach the performance level of mainframe computers, there is increasing interest in developing micro-structure special purpose machines to off-load some of the well established mainframe applications, such as database processing. Although massively parallel structures provide much higher system availability then mainframes, a significant criticism against such systems is that adequate hardware error detection capability is unavailable for todays microprocessor. Since microprocessor chip space is crucial to the performance of VLSI chips, it is impractical to employ totally self-checking circuits, such as those commonly used in mainframe processors. Several attempts have been made to address this error detection issue.
Perhaps the most common technique used in fault-tolerant systems is the simple physical replication of processing hardware. Advances in very large scale integrated (VLSI) circuits and the advent of very inexpensive microprocessors make hardware replication an even more desirable, practical approach to implementing a fault-tolerant system. Highly reliable digital processing is achieved in various computer architectures employing redundancy. For example, triple module redundancy (TMR) systems employ three CPUs to execute the same instruction stream, along with separate main memory units and separate I/O devices. The CPUs duplicate functions so that if one element fails, the system can continue to operate. (For further information on TMR systems, see U.S. Pat. No. 4,965,717, entitled "Multiple Processor System Having Shared Memory With Private-Write Capability," and/or a W. McGill et al. article, entitled "Fault Tolerance in Continuous Process Control," IEEE Micro, pp. 22-33, December, 1984.) A drawback to TMR system processing is that the individual replicated modules must operate in instruction cycle synchrony, which means that they must share a common clock to closely synchronize the replicated processes. Because a single clock drives each replicated process, clock failure is devastating to operation of the system, i.e, would define a single-point failure.
Another solution to improving fault-tolerance within a multi-processor system environment is to use a software data collection and voting technique, for example, following processor completion of each instruction (see Wensley et al., "SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control," Proc. IEEE, Vol. 66, No. 10, pp. 1240-1255, October 1978). The advantage to such a software data collection and voting approach is that each processor executes an identical application task but is driven by an independent clock so that only loose synchronization between processors is maintained. Each processor's resultant data is broadcast to the other processors, where the data is then voted on using a predefined software routine. Because this approach utilizes the actual resultant data from each processor, and is typically implemented subsequent each instruction execution, a significant drawback is the extensive communications overhead required for its implementation. The technique essentially ignores communication and computation overhead.
Thus, a novel approach to error detection in a multiprocessor distributed computing system is needed, and in particular, such an approach which can be asynchronous and which minimizes communication and computation overhead between and among the multiple processors of the system. The solution presented herein utilizes signature collection and analysis.
Signature analysis with shift registers has been used in the testing and manufacturing of VLSI chips for many years (see, for example, R. A. Frohwerk, "Signature Analysis: A New Digitial Field Service Method," Hewlett-Packard Journal, pp. 2-8, May 1977). Briefly, in this environment signature analysis is used to compress a large data stream of information to be tested into a simple unique signature to reduce the complexity of testing computations. This signature analysis concept is adopted and modified pursuant to the present invention for the multiple processing system environment.