This invention relates to a method and apparatus for producing a synchronized fault tolerant output from asynchronous inputs and in particular to a synchronized fault tolerant reset requiring only three modules. The reset outputs are not only fault tolerant and synchronized to each other, but also to the system clock.
Some systems improve reliability by using redundancy. Replication of elements in a parallel fashion may assure that a particular task is carried through to completion even though one of the replicated parallel elements becomes impaired. Such a redundant system is tolerant of faults.
In a fault tolerant computer system employing redundant microprocessor modules, it is desirable that the microprocessors perform their operations in synchronization with respect to the other parallel microprocessors. Such lockstep operation is achieved by providing the processors within the redundant system with local clocks derived from a tightly synchronized fault tolerant clock source and providing each processor with a fault tolerant reset signal synchronized to the local processor clock. The reset signal could be generated in response to a power-up signal (cold start), a manual reset signal (warm start), or by software command. The local processor clocks may be provided by the tightly synchronized fault tolerant clock disclosed in U.S. Pat. No. 4,984,241 describing a 36 MHz clock having a synchronization between modules measured to be better than (less than) one nanosecond.
FIG. 1 shows a typical reset circuit for a three module redundant processing system. The three modules are designated 2A, 2B and 2C respectively. A majority voter 4 in each module receives asynchronous reset signals from each module's corresponding reset signal source. According to the majority of inputs, the majority voter outputs a majority output signal to a latch 6 which will latch this signal to its output upon receiving a latching transition of the local clock signal. The output of the latch is a voted reset signal synchronized (by the latch) to the local clock for controlling the module's microprocessor 8. The output state of the latch in each of the modules should be the same and the processors should operate in lockstep, plus or minus whatever time offsets may exist between the local clocks of the individual modules.
However, because of time offsets which may exist between the various local clocks, variations in propagation delays between various modules, and asynchronous power supply start ups, metastability and "cycle skipping" may occur. Additional problems may arise due to finite rise times associated with signal transitions and variation in threshold values between different latches.
Cycle skipping describes an undesirable phenomenon with respect to a redundant processing system wherein some of the processors are out of step. This happens when some of the processors begin execution on one clock cycle and others begin on a subsequent clock cycle. As a result, the processors are "out of step" with the first processor one (or more) step(s) ahead of the subsequent processors.
FIG. 2 illustrates cycle skipping when multiple resets under transition are clocked by the same clock edge. The output of the majority voter will indicate a reset when a second reset signal is received at the input of the voter, assuming the reset signal is a low logic level. As shown in FIG. 2, the asynchronous reset signals occur before the transition of the clock, and therefore the output of the latch should correctly indicate a reset signal. But if the propagation delay of the majority voter is large enough, the output of the majority voter will not provide a reset signal to the latch until after the clock transition, whereby an out-of-step output signal occurs.
A small time shift in the transition of the output of the majority voter with respect to a clock transition can impact the responsiveness of corresponding latches. Small differences in propagation delays between the respective modules may cause the latch 6A to receive its majority voter output signal before the latch 6B receives its majority voter output signal. Thus, when a clock capturing transition arrives, latch 6A latches and forwards a reset signal to its microprocessor 8A while latch 6B does not. As a result, cycle skipping occurs and the microprocessors will not operate in lock step.
This problem is further compounded when the various modules have local clocks which are slightly offset in time with respect to one another as shown in FIG. 3. At very high clocking rates, a minor offset becomes quite significant and the requirements for obtaining a fault tolerant reset stringent. Assuming that there is no offset between the transitions of the outputs of the majority voters, if the transitions of the local clock signals of the various modules occur at about the same time as the transition of the majority voter output, cycle skipping may result; with modules having an early local clock edge reporting a non-reset condition and those receiving a late local clock edge capturing and reporting a reset condition. When the modules skip, those processors which fail to receive a reset condition begin executing out of step on a subsequent clock edge or may not begin execution at all.
Another problem (illustrated in FIG. 4) may arise when two input transitions are offset in time with respect to one another and a third module produces a malicious failure within this offset time interval. In a simply majority voting structure, multiple voted resets result.
FIG. 5a shows the timing requirements associated with a flip-flop's setup time and hold time and FIG. 5b is a timing diagram illustrating metastability. Whenever a clocked flip-flop synchronizes an asynchronous (unpredictable timing) input signal, there is a probability for metastability to occur. This happens when the input transition violates the setup and hold-time specification and the transition occurs within the time window where the flip-flop decides on the input signal. Metastability can manifest itself in two ways: the flip-flop can go into a non-binary state, higher than logic 0 level but lower than logic 1 level; or the flip-flop can temporarily oscillate, and come out of oscillation when the circuit noise pushes it to either logic level. The metastable time duration is unpredictable but usually bounded.
It is therefore desirable to produce fault tolerant synchronous reset signals from asynchronous reset inputs; in a manner capable of accommodating slight timing offsets between local clocks, forgiving of variations in propagation delays, tolerant of latch threshold variations, immune to metastability problems, and capable of dealing with malicious failures.
In general, the prior art addresses synchronous voted reset signals for systems having slower clock frequencies, for example 10 MHz, providing for large time offsets of several nanoseconds and does not address the problems of metastability and cycle skipping associated with the need to synchronize the reset signals with a system clock. In addition, the prior art is not rigorously fault tolerant and is generally incapable of accommodating a single point failure and can only tolerate simple failures such as stuck at "0" or stuck at "1".
U.S. Pat. No. 4,644,498 discloses a fault tolerant real time clock for a triple modular redundant computer system. The 5 MHz clock comprises three identical channels having separate power supplies, each channel including a majority voted master clock, a counter for producing real time clock pulses, and a power-up time out circuit. The time out circuit is a three input NOR gate that receives three power-up signals from the individual channels. When all the channels are powered up and stabilized, the master clock is gated to the counter and the reset clock output circuit. A reset signal is maintained at the processor for four clock pulses thereby allowing processor startup and countdown to begin in unison for the channels. However, the reset signals in this design are not fault tolerant. For example, due to the NOR gate 34, all three reset channels will be stuck if the power up time-out of any channel is stuck at high. Metastability and cycle skipping can also occur at flip-flop 38 due to asynchronous reset inputs and clock edges.
In Soviet Union Inventor's Certificate No. 378830, a redundant computing system provides a reliable synchronization between channels having random inputs. The design however, does not solve the metastability and cycle skipping that can occur at the latch. In addition, a short at the input of the voter would also fail all three modules.
U.S. Pat. No. 5,117,442 discloses a synchronizing circuit for multiple reset input signals. The circuit consists of three slices and each comprises an initial synchronizer (first stage flip-flop), a local synchronizer (second stage flip-flop), a comparison circuit (majority voter) and a final synchronizer (third stage flip-flop). This design only works for nominally synchronized reset inputs and can only tolerate a benign fault such as an input arriving late, arriving early, being stuck at zero or being stuck at one. This circuit cannot tolerate a random faulty input while the other two inputs are truly asynchronous (i.e., their transitions or displacement window are many clock cycles apart) as shown in FIG. 4 herein. In this case, all voters' outputs will exhibit the faulty behavior of the random input because the voter is a simple majority level-voting circuit that performs asynchronous logical operation. Although synchronized to the clock edge by the final stage flip-flop, multiple voted resets will occur as a result. This behavior is evident in all of the illustrated output waveforms of the voters which show glitches between the small displacement window of the two "good" input signals. These glitches will be amplified to become lengthy random faulty behavior for a larger displacement window between the two good inputs for truly asynchronous inputs. Even if masked or self-checking logic such as dual-rail logic is used in place of the illustrated logic elements, the design still cannot tolerate a random faulty input while the other two inputs are truly asynchronous. Dual-rail logic only adds an identical logic element for comparison without altering its intended function. In addition, this circuit cannot guarantee against metastability due to the asynchronous nature of the reset input with respect to the clock edge. The first stage flip-flop can go metastable and the second stage flip-flops only depend on the probability that the metastability will subside when it reaches the second stage. If it does not subside within half a clock cycle, the second stage flip-flops will go metastable and the rest of the circuit will exhibit faulty behaviors.