1. Field of the Invention
The present invention relates to fault-tolerant computer systems, and more particularly to methods and fault-tolerant circuits for synchronizing multiple asynchronous digital signals, such as reset signals, in such systems.
2. Description of Related Art
A Triple Modular Redundant (TMR) computer system is a type of fault-tolerant computer system designed to continue operating despite a failed component or connection. In a TMR system, three or more identical processors work synchronously on the same task, with their outputs compared or "voted" by hardware or software to provide a majority answer as output. The TMR system continuously monitors each processor output so that, when a discrepancy between processor outputs is discovered, the failed processor may be disabled and operation of the remaining processors continued in a "fail operational" mode.
When only two of the processors remain in operation, the system is considered to be operating in a "pair" mode. In pair mode, the processor outputs are compared so that when a discrepancy is discovered, the system may be shut down, or "failed safe." "Pair" systems are similar to TMR systems operating in "pair" mode in that only two identical processors work synchronously on the same task and their outputs are compared to permit "fail safe" system shutdown in the event of a discrepancy.
Synchronized, or lockstepped, operation of processors in a Pair or TMR system is predicated on synchronization of the processors. If the system is truly redundant, a separate source of clock signals is provided for each processor. All processors are synchronized initially upon start-up, and may be periodically synchronized thereafter to correct for signal timing divergence which may occur during operation of the system. Given a bounded displacement of clock edges ("skew") of each processor relative to those of the other processors, all processors must start on "the same" clock edge. For example, if clock skew is .about.1/4 of a full cycle of the clock signal, it is not acceptable to have one processor start 3/4, 1 or 11/4 cycle earlier or later than the other(s).
In addition to clock skew, the TMR system must be able to cope with early or late arrival or non-arrival of one or more signal input events. It is known in the art that if a transition event is presented to the inputs of two or more clocked circuits (e.g., edge-latched D-type flip-flops), the corresponding event appearing at their outputs may differ in time (signal timing divergence). If the signal timing divergence becomes as great as one clock period, cycle skipping can occur. This is especially likely if the input changes near the minimum setup time to a clock edge, and the clock signals are skewed by a finite amount or the logic delays of the flip-flops differ slightly. Thus, even a single event can cause discrepancies when multiple clocks, or distributed versions of a single clock, are used in a TMR or Pair system. There is an even greater likelihood of discrepancy when multiple events or multiple copies of a single event must be coordinated, as in a TMR or Pair system. Another problem which can occur is metastability of a flip-flop, where its output becomes indeterminate when its input changes near the minimum setup time to a clock edge, for example. It is also known that if a component, wire or solder connection fails, an expected event may not arrive when and where expected.
Prior art systems which address certain aspects of synchronizing digital signals are known. For example, U.S. Pat. No. 4,232,387 to Cottatellucci discloses a data transmission system in which is single received signal is split, phase-shifted and recombined to derive a synchronization waveform of use in decoding the received signal. Further, U.S. Pat. No. 4,302,831 to Zemanek discloses a data transmission system in which a time interval, derived from a received initialsynchronization word, is adjusted to keep step with phase variations in the remainder of a message arriving on a single line. Phase comparison is employed as a step in determining whether and how much to adjust the period of a clock generator.
Also known, from U.S. Pat. No. 4,348,762 to Shiun et al., is a circuit for correcting clock pulses used to read data. A plurality of groups of clock pulses of different phase is generated, one of such groups is used to read data, and a switch responsive to misreading of data with the aid of such group selects another group or groups of pulses until correct data reading is accomplished.
The disclosures mentioned above do not, however, deal with the problems of multiple logical signals and multiple, skewed clock signals encountered in TMR and Pair systems. Nor do they address the need for fault-tolerant production of the digital signal in TMR and Pair systems.
Systems achieving fault tolerance by means of majority voting are known in the art. For example, U.S. Pat. No. 4,375,683 to Wensley discloses a fault-tolerant computational system having a voter circuit which receives data inputs from several computation devices and produces an output in agreement with a majority of the inputs. A clock circuit counts pulses from the clocks of the computation devices and employes majority voting to produce a single signal for synchronizing the data output of the computational devices. Such a system does not, however, serve to resynchronize data in the voter circuit, nor does it provide multiple fault-tolerant copies of the voted output.
In addition, U.S. Pat. No. 4,583,224 to Ishii et al. discloses a fault-tolerable redundant control system in which majority logic is used with error-detection logic to detect faults. The Ishii et al. disclosure does not, however, address the synchronization of signals associated with skewed clocks.
U.S. Pat. No. 4,330,826 to Whiteside et al. discloses a synchronizer module for each processor of a fault-tolerant multiple computer system in which a sampling period is timed, the majority vote of the samples from the processors is taken, and the sampling period is adjusted so that its end will approximately coincide with the end of the sampling periods of the other processors. While the synchronizer modules permit late starting of one or more processors in the system, and is intended to identify processors which are out of synchronization with the system for fault-detection purposes, it does not appear to synchronize the logical signals of multiple processors within the bounds of the skew between clock signals of the processors.
Further, U.S. Pat. No. 4,589,066 to Lam et al. discloses fault-tolerant synchronization for multiple processor systems in which majority voting is used to determine whether synchronizing pulses arrive within a predetermined time window defined by a counter, indicating synchronization between multiple processors. Since the synchronization is linked to a time window, it has the disadvantage of being less fine-grained than may be desired.
U.S. Pat. No. 4,644,498 to Bedard et al. discloses a fault-tolerant real time clock for a TMR system. Voted master clock pulses in each of several subcircuits are counted to produce real time clock pulses which are in turn majority-voted to produce voted real time clock pulses. U.S. Pat. No. 4,683,570 to Bedard et al. further discloses use of majority voting logic to detect and indicate failure of a clock circuit. Rather than addressing the mutual synchronization of logical control signals within the bounds of the skew between the clock signals of multiple processors, these disclosures are concerned with generation of voted clocks and with detection of a failure in the voted clock circuitry.
Synchronization circuits which address the possibility of a metastable flip-flop state are also known. U.S. Pat. No. 4,498,176 to Wagner discloses error-free synchronization of asynchronous pulses in which a flip-flop circuit compares its outputs with a predetermined voltage to determine whether the circuit is in a metastable state and temporarily inhibits its outputs if a metastable state is present. This arrangement does not, however, address the synchronization of multiple signals or offer fault-tolerance.
Another arrangement, disclosed in U.S. Pat. No. 4,700,346 to Chandran et al., employs a single clock to synchronize the skewed leading edges of a true-complement signal pair. Several D-type flip-flop stages in two halves of the circuit are driven by the single clock, the second of the stages preventing metastable states from reaching the synchronizer output. This arrangement does not address the problems arising from use of multiple, skewed clocks.
None of the above-referenced patents is understood to teach methods or circuits for synchronizing logical control signals in fault-tolerant modular redundant systems which provide for: (a) synchronization of each logical control signal to the clock signal of the processor to which it is to be applied; (b) mutual synchronization of the logical control signals to be applied to the processors, within the bounds of the skew between the clock signals of the processors; and (c) fault-tolerant production of the logical control signals. It is broadly an object of the present invention to provide such methods and circuits.
It is further an object of the present invention to provide apparatus and methods for mutually synchronizing multiple, possibly asynchronous, digital input signals within the limits of bounded skew between clock signals associated respectively with the input signals. The methods and apparatus of the present invention may be used to synchronize logical control signals (such as processor reset signals, clock synchronization signals, external direct memory access (DMA) request signals, interrupt signals, and the like) for use in fault-tolerant modular redundant computer systems.
Yet another object of the present invention is to provide such apparatus and methods which are themselves redundant and fault-tolerant.
These and other objects of the invention will become apparent to those skilled in the art from the description which follow.