1. Field of the Invention
This invention relates to data processing systems and, more particularly, to a method and apparatus for synchronizing multiple processors to ensure data integrity or to provide fault tolerance.
2. Description of the Relevant Art
As the use of microprocessors becomes more prevalent, the need to ensure data integrity and to provide for tolerance of hardware or other faults within the system has become more critical. One method of ensuring data integrity is to provide two processors which execute the same program and which compare the results to detect errors. Once an error is detected, appropriate error routines may be invoked. Methods for providing fault tolerance include having three or more processors which execute the same program and which use a majority vote of the results in order to tolerate one or more malfunctioning processors. Alternatively, two pairs of processors may be configured so that, if an error in one pair is detected, that pair is disabled and processing continues in the other pair. In all of the foregoing configurations, the processors must be synchronized to allow results to be compared or voted.
Known systems which ensure data integrity and provide fault tolerance may be divided roughly into four categories: 1)lock step systems which use nonfault tolerant clocks; 2)lock step systems which use fault tolerant clocks; 3)systems which vote bus interface signals; and 4)systems which implement synchronization and voting through software. An example of a lock step processor without fault tolerant clocking is given in U.S. Pat. No. 4,453,215, issued Jun. 5, 1984, to Robert Reid. This patent discloses a multiple processing system using a single non-fault-tolerant clock, and the lock stepped processors must have identical bus cycle timing in order to avoid a miscompare error. The system clock thus is a single element which can effect failure of the entire system. Additionally, some processors require a very high frequency clock source, and each clock source must be synchronized, typically within one-half of a clock period. This is difficult, if not impossible, to accomplish with high-frequency clocks. Furthermore, there are restrictions on the type of processor which may be used in such systems due to the strict nature of the lock stepping. Some processors have cache memories on the processor chip, and the memories may have memory locations which are inaccessible as a result of manufacturing defects. These defects cause individualized cache misses and retries which, in turn, cause the processors to get out of step. Finally, in some processors there is no way to deterministically initialize the entire internal state.
An example of a lock step system which uses fault tolerant clocking is disclosed in Smith, T. Basil, "High Performance Fault Tolerant Real Time Computer Architecture," PROC 16th Annual Symposium on Fault Tolerant Computing, pp. 14-19, July 1986. This system alleviates some of the problems of lock step systems without fault tolerant clocks by generating multiple low speed clocks using a fault tolerant clocking circuit. The fault tolerant clocking circuit is independent of the processor clock. The disadvantages of this scheme is that each processor may have different clock speeds but only one I/O clock speed, and the slowest I/O clock does not readily scale to faster processors. As a result, all I/O must be synchronized to the slow clock. This causes the I/O performance to be severely degraded. Finally, the fault tolerant clock is difficult to prove correct because voting of the clock signals is done using an analog system with feedback, and thus Boolean equations cannot be used to verify its operation.
Synchronization schemes in which bus interface signals are voted are presented in Davies, D. and Wakefly, J., "Synchronization and Matching in Redundant Systems," IEEE Transactions on Computers, pp. 531-539, June 1978, and McConnel, S. and Siewiorek, D., "Synchronization and Voting," IEEE Transactions on Computers, pp. 161-164, February 1981. These papers suggest systems which run off of independent clocks and which use voting of interface signals for synchronization. For example, a plurality of processors may be connected to a corresponding plurality of memories. When a processor makes a request to its associated memory, the memory waits until it detects a request from at least one other processor to its associated memory before it acknowledges its own received request. Similarly, the processors will not recognize the memory acknowledgements until acknowledgements from at least two memories are received. While this method avoids the need for fault tolerant clocks, and it may possibly tolerate "extra" clock cycles (as a result of error retries, variations in cache hit rates, asynchronous logic, etc.), it has drawbacks. For example, the waiting requirement adversely affects the timing of many speed-critical bus interface signals. This may have an enormous impact on performance. Additionally, neither paper addresses the difficult problem of synchronizing unsolicited external interrupts. If the processors sample at different times, one may detect the interrupt signal and others may not.
Systems in which synchronization and voting are both implemented through software include software implemented fault tolerance (SIFT), disclosed in Weinstock, Charles B., "SIFT: Design in Implementation," PROC. 10th Annual Symposium on Fault Tolerant Computing, pp. 75-77, October 1980, the August Systems Series 300 disclosed in Frison, S. G. and Wensley, John H., "Interactive Consistency and Its Impact On the Design of TMR Systems," PROC. 12th Annual Symposium on Fault Tolerant Computing, pp. 228-233, June 1982, and an experimental system disclosed in Yoneda, T., et al., "Implementation of Interrupt Handler for Loosely Synchronized TMR Systems," PROC. 15th Annual Symposium on Fault Tolerant Computing, pp. 246-251, June 1985. These systems use independent clocks, and hence they can tolerate "extra" clock cycles. However, they require an extra layer of software which performs voting and synchronization by exchanging messages between the processors. Standard systems software may not be used. Instead, complex software is required to assure that all processors respond identically to interrupts, and this extra software severely degrades performance of the system.