This invention relates to fault-tolerant signal processing machines and methods.
Many signal processing applications, both digital and analog, employ a number of identical linear processors processing multiple signals in parallel. For example, radar and sonar signal processing requires linear processing on large amounts of incoming data at a rapid rate, often in real time. Massive computational requirements often lead to highly parallel multiprocessor architectures. Oftentimes each of the processors is performing an identical linear processing task. Typical linear processing operations are filtering and the creation of transforms such as Fourier transforms. As the number of processors increases, so too does the frequency of hardware failure making fault tolerant systems desirable.
In conventional N-modular fault-tolerant computer designs, N copies of processors, memory, and I/O units are driven with the same program and the same data. Voter circuitry compares the outputs of the identical units to verify that all units are operating correctly. With 100% redundancy (N=2) combined with frequent software checkpoints, the voters can detect a failure, abort the current operation, and force a restart from the last checkpoint, using different processing hardware. With 200% redundancy (N=3), also known as Triple Modular Redundancy, the voters can immediately detect and correct a single transient or permanent failure without any visible effect on the outside world, or on performance. The disadvantage of schemes of this type, however, is that large amounts of hardware must be dedicated to monitoring potential faults, and very tight processor synchronization is required.
Other ideas for reducing the hardware overhead dedicated to fault tolerance include incorporating periodic self-test or using a few extra processors and memories as Roving Emulators to double-check selected computations. Unfortunately, although these systems use less than 100% redundancy, they do not detect or mask all faults, and they may take a while to detect a failure.
The huge hardware requirements for traditional fault-tolerant designs are disturbing because such high levels of redundancy are not required in related problem areas such as error-free communication over noisy channels. In these problems, Shannon's channel coding theorem guarantees that virtually error-free communication can be achieved using only a small amount of overhead for error coding. The trick is to exploit a model of the channel noise process, coding the data and spreading it across the channel bandwidth so that any noise spike may destroy a portion of many data bits, but not any entire bit. A reliable decoder is able to reconstruct any missing information in the original signal by exploiting the small amount of redundancy inserted by the reliable coder. Such error coding ideas can be conveniently extended to protect data transmission over possibly faulty busses or networks, and to protect data storage in possibly faulty memory cells. Unfortunately, such error coding ideas cannot be conveniently extended to protect against hardware failures in arithmetic units, or in other systems where the output is not simply a copy of the input.
Happily, if we restrict the class of applications, it is possible to apply low-redundancy, error coding ideas to protect computation. In particular, Huang and Abraham and Jou have suggested single error detection and correction techniques for matrix operations on linear or mesh-connected processor arrays using a weighted checksum approach. This invention presents a scheme for use in a fault-tolerant multiprocessor system, using weighted checksums to protect a variety of linear signal processing computations.