1. Field of the Invention
This invention is generally directed to the analysis of the mean time between failure (MTBF) of digital circuits and more specifically to the problem of predicting MTBF due to metastable behavior in multistable devices such as flip flops.
2. Description of the Relevant Art
A large number of digital circuits can be characterized as having two or more stable states at which the systems preferentially remain and one or more unstable states from which the systems tend to shy away. Stable states may be visualized as low points in an energy plane populated by valleys, plateaus and hills. Unstable states may be seen as hills from which the systems can dynamically roll off to eventually settle in a valley below. There is a certain type of rolling behavior known as metastable behavior which can lead to failure of digital devices. Because this type of behavior is presently understood by only a few in the art, the following detailed explanation is included here. Those who are well versed in metastability theory can skip forward to the last paragraph before the summary of the invention. Reference to papers published for session 16 of the Wescon/87.RTM. Convention held November 17-19, 1987 in San Francisco, California might be helpful. The session record is entitled "Everything You Might Be Afraid To Know About Metastability" and includes discussions by Kim Rubin of Force Computers Inc., Keith Nootbaar, et al. of Applied Micro Circuits Inc., John Birkner of Monolithic Memories and Martin Bolton of Bristol University.
A multistable digital system can be perceived as one that moves from one valley to another as it is switched from one stable state to the next. Energy must be supplied to the system to push the system across the energy plane and help the system overcome barriers posed by high points between valleys. If an insufficient amount of force or energy is supplied for overcoming high points, the outcome of an actuation operation might become indeterminate. The induced actuating force of an input signal may be on the border line of that required for propelling a system out of one valley, over a barrier and into a desired second valley. It may not be known for sure whether the system has actually switched from one stable state to the next as intended or whether the system has rolled to an undesirable state.
The phenomenon is referred to by various names including "the metastability problem", "arbitration conflict" and "synchronization failure." Regardless of the term used, it is known that when such a phenomenon occurs it can cause a logic circuit to produce erroneous outputs and that such erroneous outputs may lead to catastrophic results. Given this possibility, it is desirable to have a quantitative method for studying the phenomenon so that the benefits of a digital system can be intelligently weighed against the risk of an erroneous output.
By way of example, there are situations where a digital computer is designed for placement in remote locations, such as on board a space satellite, and the computer is required to operate over a period of at least ten years without a single error. If a pre-flight evaluation of the computer design were to show, statistically speaking, that there is a mean time between failure (MTBF) of say two years under worst case conditions, then it will be obvious that additional measures need to be taken to extend the MTBF of the computer. Testing the computer in real time over multiple ten year periods in order to detect how many errors can be caused by metastable events is not practical. A method for rapid statistical evaluation of logic circuit components needs to be devised so that MTBF can be predicted on an accelerated basis. The evaluation method should be both accurate and capable of easy repetition (for confirmation purposes) so that risk assessments can be made with valid MTBF values rather than purely speculative data.
Numerous models have been proposed for studying the so-called "metastable" behavior of multistable logic circuits. FIG. 1A shows one circuit model 10 proposed for studying a bistable switch SW that is capable of switching between a logic low state (L) and a logic high state (H). Switching occurs when an input signal S.sub.in of sufficient energy is supplied at an input of the switch SW. A first amplifier A.sub.1 of a first predetermined gain and bandwidth amplifies the input signal S.sub.in by a predetermined factor and supplies actuation energy E.sub.act to a switchable part (e.g. armature) of the switch SW. A second amplifier A.sub.2 of a second predetermined gain and bandwidth is provided, coupled to an output node N.sub.o of the switch, for supplying a holding energy E.sub.hold in a feedback manner to the switchable part so as to maintain the switch SW in a present state. The holding energy E.sub.hold opposes accidental actuation due to noise and also opposes the initial force of any actuation energy E.sub.act supplied by the first amplifier A.sub.1. An inherent resistance R and capacitance C of the model 10 cause an output signal S.sub.out at the output node N.sub.o to behave as a continuous exponential function of the form S.sub.out =f(e.sup.-t/RC).
Referring to FIG. 1B, it can be seen that the continuum of values attainable by the output signal S.sub.out may be partitioned to define a low state (L state), a meta state (M state) and a high state (H state), arranged in the recited order. A system energy curve 12 having two minimum points, 14 and 16, respectively corresponding to the L and H states of the model and a peak, 15, corresponding to the M state, may be drawn to explain the model's behavior in a loose manner. The current state of the model 10 may be thought of as represented by the position of a rollable ball 18 which moves along the curve and eventually comes to rest at a minimum point in one of two energy valleys (stable states) after it is pushed over an energy crest (barrier) defined by the peak 15. The shape of the energy barrier corresponds to forces generated by the holding energy E.sub.hold of the second amplifier A.sub.2. The holding energy E.sub.hold initially opposes the actuation energy E.sub.act but may later switch to aid the actuation energy once the barrier is overcome and the ball 18 begins to roll downhill. The speed at which the ball rolls is determined by the gain bandwidth products of the system amplifiers, A.sub.1 and A.sub.2. Because the gain bandwidth products of real world amplifiers (as opposed to hypothetical models) tend to be functions of environmental factors such as temperature, power supply settings, fabrication process variations and so forth, it is helpful to view the energy curve 12 as one that changes its shape when the environmental factors shift. If these factors shift to produce relatively high gain bandwidth products, the steepness of the downhill slope increases, the ball rolls away faster and is thus, statistically speaking, less likely to be found in the vicinity of the peak 15. On the other hand, if the gain bandwidths products become relatively small as a result of environmental changes, the slopes become less steep and the probability that the ball will be found in the vicinity of the peak increases. It will be seen later in the discussion of the invention that environmental factors such as temperature and power supply settings can significantly alter the likelihood of what we will call "metastable behavior." For the present, it is sufficient to understand that environmental factors determine how likely it is that the ball will be found in the vicinity of the peak 15.
If the ball 18 is pushed with just barely enough energy E.sub.act, it is possible that the ball will come to rest precisely at the maximum point of the curve peak 15 where the slope of the curve (dE/dS) is equal to zero. When so positioned, the ball 18 may remain in the M state for an indefinite period of time, or it may randomly roll towards one or the other of the L and H states as a result of random environmental noise. In such a case the final resting state of the model 10 becomes uncertain and the outcome needs to be defined in a statistical manner. This type of indeterminate behavior is referred to as metastable behavior.
Multistable circuits often include one or more positive feedback loops which modify the above picture such that it may be necessary to further consider the curve 12 as being elastically deformable and to view its peak 15 as being able to temporarily change shape or shift in position relative to the L and H states during actuation to create what may appear in effect to be a third valley 15a (as indicated by the dashed portion of FIG. 1B) in the metastable region M. The valley 15a, which is better shown in FIG. 1D as being distributed over time, provides an area in which the ball 18 may oscillate back and forth for an indefinite period of time before moving on to finally rest in one of the stable states, L and H. This oscillatory behavior may be attributed in the particular model 10 of FIG. 1A to positive feedback in the feedback loop formed by the second amplifier A.sub.2 and the RC network. The occurrence of such oscillatory behavior may be rare in the short run (especially when the supplied actuation energy E.sub.act is thought to be relatively large) but there is a real possibility that oscillation will occur (even with large actuation energies) and this possibility can't be ignored when analyzing the long run.
FIG. 1C shows a so called first order model that is often used to study metastability. The model comprises two ideal NAND gates (hypothetical circuits of infinite gain bandwidth product) each having an input coupled to an output of the other through an RC network. When the delay time of each of the RC networks is increased, the input signal at the input of each NAND gate spends more time in the vicinity of the gate threshold level (gate switching point) and the probability of metastable behavior increases (i.e. there is a greater likelihood that random noise might position the system at the zero slope midpoint of its energy curve). This behavior of the model correlates well with various types of metastable behavior observed in real world circuits and hence lends support to the validity of the first order model.
The metastability phenomenon discussed thus far is perhaps better understood by reference to a concrete example. The number of variables that may affect metastable behavior will be appreciated by considering the circuit of FIG. 2. FIG. 2 illustrates the circuit topology of a well known bistable device 20 often referred to as an R-S flip flop. The device 20 is formed by cross coupling a pair of dual input NAND gates, 22 and 24. While the circuit topology of this device 20 is very well known in the art, experience with the metastable behavior of its various embodiments is limited and thus not fully understood. This is particularly true of flip flop circuits fabricated with the newer high density VLSI technologies wherein new materials, narrower line widths, and so forth are utilized in place of older designs.
In the design of the bistable device 20, a first, inverted input 22a of NAND gate 22 is connected to receive a first input signal V.sub.in1. A first, inverted input 24a of NAND gate 24 is connected to receive a second input signal V.sub.in2. Second inputs, 22b and 24b, of NAND gates 22 and 24 are respectively coupled to outputs 24c and 22c of the NAND gates to thereby form multiple feedback loops. A number of oscillatory modes become possible because of this type of cross-coupled feedback topology.
Output 22c (node N.sub.o) will remain at a steady state low (L) if a high level (H) is present for a long time at the first input 24a (reset terminal) of NAND gate 24 and a low level (L) is simultaneously present for a long time at the first inverted input 22a (set terminal) of NAND gate 22. If a logic high (H) set pulse 26 of a short effective duration T.sub.act is included in the first input signal V.sub.in1, as shown in FIG. 2, the set pulse 26 will supply a finite amount of actuation energy E.sub.act to the bistable device 20. This energy is supposed to initiate the switching of output 22c from its logic low state (L) toward the logic high state (H). But the amount of actuation energy E.sub.act that is effectively presented by the set pulse 26 will depend on the height of the set pulse 26 above a threshold level V.sub.T1, that threshold level being one associated with the input of NAND gate 22 (i.e. such as the gate turn on threshold of MOSFET devices) and on the duration of the effective time period T.sub.act during which the set pulse 26 exceeds the threshold level V.sub.T1.
In most instances, say 98% for the sake of example here, the output 22c of NAND gate 22 will move from its initial low state (L) through the metastate (M) and come to rest in the high state (H) when the set pulse 26 is presented. (This 98% figure is, as practioners of digital electronics will of course appreciate, a gross exaggeration. It is used here so simple numbers can be worked with. The probability of the L.fwdarw.M.fwdarw.H transition is usually 99.99% or better in commercial grade products.) There is a finite possibility however, say 1.5% for this example, that the output 22c will move from the L state to the M state and then return back to the L state prior to a preselected strobe time t.sub.s because an insufficient amount of actuation energy was supplied for assuredly switching the state of the output node N.sub.o. This event could happen, for example, when there is a large amount of noise present at the output node N.sub.o and the noise is fed back to the circuit inputs by way of one of the dual feedback paths or if node N.sub.o is heavily loaded. The L.fwdarw.M.fwdarw.L outcome may at times be considered a failure of the bistable device 20 if it is known that a valid logic high (H) was presented at the set input 22a, a valid low (L) was simultaneously present at the reset input 24a and that output node N.sub.o was therefore supposed to switch from the L state to the H state. If a train of set pulses 26 having very short effective durations (e.g. a T.sub.act of a few nanoseconds or less) and reset pulses 27 are repeatedly supplied to the R-S flip flop 20 at a rate of say one every 10 microseconds (a 100 KHz data rate), it can be predicted that the device 20 will produce an incorrect result on the average of approximately once every 667 microseconds in our example. If the probability of the L.fwdarw.M.fwdarw.L sequence were to somehow drop from 1.5% to say 0.5%, the MTBF will increase threefold to 2000 microseconds. If the data rate were to be reduced to one pulse every 100 microseconds (a data frequency of 10 KHz) instead of one every 10 microseconds then the MTBF will increase by tenfold to approximately one error every 6,667 microseconds. It can be seen from this that the MTBF is proportional to the inverse of the data rate and the inverse of the failure mechanism probability.
The above is but one of many environments in which the device 20 may be operating. In some cases, the bistable device 20 (R-S flip flop) is required to supply a valid output level within a predetermined, fixed time interval, t.sub.0 to t.sub.s, i.e. before a reset pulse 27 and/or a new system clock pulse arrives. Its output state after the interval is ignored. There is a finite probability, say 0.5% for the sake of our example, that the output 22c will still be in the M state (e.g. still oscillating) when the predetermined time interval terminates. In such a case, there is a probability that the M state output will be incorrectly interpreted by subsequent circuitry as a logic low L when the result was supposed to be a logic high H or that the M state at the output 22c will trigger further undesirable oscillations when presented to the digital input(s) of the subsequent circuitry (not shown). Such an occurrence constitutes a failure of the bistable device 20 which must be accounted for when studying the reliability of a digital decision making system.
Even when the peak magnitude of set pulse 26 is sufficiently large relative to the input threshold level V.sub.T1 to assuredly switch device 20 under normal timing constraints, the effectiveness of that peak magnitude may be cut short by the counter-action of a second signal at a second input terminal. If the first input 24a (reset terminal) of NAND gate 24 receives a reset pulse 27 at some time t.sub.G2 well after the application of set pulse 26, this will usually not affect the outcome of the above-described switching operation. On the other hand, if the second input signal V.sub.in2 includes a reset pulse 27 whose leading edge (a low to high transition) occurs during the same time when the set pulse 26 is still being is presented, the leading edge of the reset pulse 27 may cut down the effective actuation time T.sub.act of the applied set pulse 26. This can increase the probability that set pulse 26 will fail to switch output 22c from the L state to the H state. The probability of failure will approach 100% as the effective introduction time t.sub.G2 of the reset pulse 27 in the second input signal V.sub.in2 (when V.sub.in2 exceeds a second threshold level V.sub.T2) is pushed backwards in time towards the gating time t.sub.G1 at which the first input signal V.sub.in1 first begins to overcome the first threshold level V.sub.T1. This last scenario, in which the reset pulse 27 shifts in time to overlap at least a portion of the set pulse 26 (t.sub.G2 approximately equal to t.sub.G1), can occur with regularity in asynchronous types of logic circuits. It raises the probability of generating a metastable event, and therefrom an erroneous output.
A test environment needs to be devised to take into account the above and possibly other failure mechanisms. The effective duration T.sub.act of set pulse 26 (phase difference between rising edge of set pulse 26 and rising edge of reset pulse 27) might shrink to a point where it has to be measured in terms of picoseconds (10.sup.-12 seconds) or even smaller units (e.g. femtoseconds (10.sup.-15 seconds)). If a digital logic circuit is appropriately designed to avoid signal race problems, the likelihood that a reset pulse 27 (or a flip flop clocking signal) will coincide with a set pulse 26 (flip flop data signal) is relatively small when considered over the short run (e.g. one hour of operation). But given all the unpredictable effects of circuit environment on signal propagation delays, the close coincidence of flip flop set and reset pulses or flip flop clock and data edges is an event which will probably occur sometime during the lifetime (e.g. ten years) of an arbitrary set of digital logic circuits. An accurate method is needed for quickly determining the probability of failure from this and/or others of the above described arbitration conditions.
A test circuit 30, as illustrated in FIG. 3, has been proposed for studying the behavior of flip flops. A test flip flop FF.sub.t having a decision output Q.sub.F is formed on an integrated circuit chip 32 together with a first sampling flip flop FF.sub.s1, a second sampling flip flop FF.sub.s2, and an exclusive OR gate 34. Connection pads 32a and 32b are provided on the IC chip 32 for supplying a DATA signal of a first frequency f.sub.D (e.g. 1.5 MHz) to one input of the test flip flop FF.sub.t and supplying a CLOCK signal of a second frequency f.sub.C (e.g. 2 MHz) to a second input of the test flip flop FF.sub.t. Additional connection pads, 32c and 32e are provided on the IC chip 32 for supplying first and second strobe signals, STROBE.sub.1 and STROBE.sub.2 to the clock inputs of sampling flip flops FF.sub.s1 and FF.sub.s2. The output of exclusive OR gate 34 is supplied through yet another connection pad 32 d to an external error counting device 36. Power supply connections and miscellaneous control lines are not shown.
An error condition or "event" is defined to occur when there is a difference in level between the Q outputs of sampling flip flops FF.sub.s1 and FF.sub.s2 at some time period, t.sub.0 +y, following the time t.sub.0 of a triggering edge of the CLOCK signal. Two variable delay circuits, 38 and 40, are provided to respectively supply the STROBE.sub.1 and STROBE.sub.2 signals to connection pads 32c and 32e of the IC chip. The delay circuits, 38 and 40, are precision types located off of the chip 32 and designed to be adjusted manually to set the arrival times of their respective strobe signals to first and second strobe times, t.sub.0 +x and t.sub.0 +y. The frequency f.sub.D of the DATA signal is set to a value (e.g. 1.5 MHz) that is not a harmonic of the frequency f.sub.C (e.g. 2 MHz) of the CLOCK signal so that coincidence between edges of the DATA and CLOCK signals will occur randomly. Sampling flip flops FF.sub.s1 and FF.sub.s2 are supposed to capture and hold the states of output Q.sub.F of the test flip flop FF.sub.t at different points in time, t=t.sub.0 +x and t=t.sub.0 +y, in response to the STROBE.sub.1 and STROBE.sub.2 signals.
FIG. 3B illustrates some possible waveforms or "paths" that may be taken by the decision output Q.sub.F of a test flip flop FF.sub.t over time. Q.sub.F is assumed to be at logic low (L) prior to the arrival of a triggering edge of the CLOCK signal at time t.sub.0. No change occurs in Q.sub.F during a propagation delay period T.sub.PD of the test flip flop FF.sub.t. At a second time t.sub.1 the output Q.sub.F begins to move out of the L state and into the metastate (M state). A finite gain associated with the test flip flop FF.sub.t inhibits the output Q.sub.F from fully slewing to either a H or L level until a third time, t.sub.2. Due to the metastability phenomenon described above, the output Q.sub.F may stochastically select one of many different paths, including oscillatory ones of indefinite length, to finally arrive at either the H level or the L level. For practical intents, the outcome state of output Q.sub.F is assumed to be 100% resolved by some predetermined fourth time t.sub. 3 (say t=t.sub.0 +100 nanoseconds) following time t.sub.2 even though theoretically the final state of Q.sub.F is never resolved 100%. But at some arbitrary strobe time t.sub.s prior to t.sub.3, the output Q.sub.F might be still meandering or oscillating about a nondigital level M.sub.0 which is neither a valid logic high (H) nor a valid logic low (L).
It should be understood that for modern devices, the periods between times t.sub.0 through t.sub.3 will usually be measured in very small units such as 60 nanoseconds or less. It should also be understood that modern circuits usually require a valid output level, H or L but not M.sub.0 in a much shorter period, t.sub.o to t.sub.s. The period t.sub.o to t.sub.s is often measured in tens of nanoseconds or less (e.g. from 25 ns down to 1.5 ns or less).
The delay value y corresponding to variable delay circuit 40 can be easily adjusted to exceed time t.sub.3 using standard delay methods. The crucial problem with the circuit 30 is that the delay value x, corresponding to variable delay circuit 38, needs to be finely adjusted to different values between times t.sub.0 and t.sub.3 to a resolution of say, one nanosecond or less. This is not easy. A multitude of factors including subtle temperature changes, power supply changes, the level of background radiation (e.g. cosmic rays), etc. can shift the actual delay of circuit 38 by a few nanoseconds. The resolution of circuit 38 is therefore rarely better than one or two nanoseconds and this creates problems A shift of just one or two nanoseconds can alter test results substantially.
In use, the test circuit 30 is allowed run for a predetermined length of time (e.g. five minutes) at a particular setting of the delay circuit 38 (e.g. 5 nanoseconds .+-. a resolution error) and the number of errors accumulated in counter 36 for each assumed setting of the x value is then recorded.
Referring to FIG. 4, a logarithmic plot of the number of errors detected by counter 36 during a predetermined test period is prepared for every assumed value of x corresponding to a particular setting of the variable delay circuit 38. It can be shown mathematically (assuming the first order model of FIG. 1C) that this logarithmic plot 40 should include a linear portion 40a satisfying the formula: EQU P(error)=A e.sup.-Bx =1/(f.sub.C f.sub.D MTBF)
or written in logarithmic form: EQU log [P(error)]=log [A]-Bx=log[1/(f.sub.C f.sub.D MTBF)]
where P(error) is the probability of an event (erroneous outcome) during the test period and MTBF is the average or mean time between failure associated with the flip flop FF.sub.t for given data rate f.sub.D and clock rate f.sub.C. Parameters A and B are constants defining the linear portion 40a of the P(error) curve.
By varying the value of x, it can be experimentally shown that the number of failures detected by the test circuit 30 of FIG. 3A will approach 100% while x is smaller than the propagation delay time T.sub.PD of the test flip flop FF.sub.t (FIG. 3B). The number of errors detected by counter 36 will begin to converge on lower values of error probability once x is increased beyond the second time t.sub.1 of the graph shown in FIG. 3B. When x is increased to surpass the fourth time t.sub.3 of FIG. 3B and the test time is limited to some finite length (e.g. five minutes), the number of errors found in counter 36 will flatten out. As such, the test circuit 30 of FIG. 3A should theoretically be capable of producing values of P(error) corresponding to plot points of assumed x values in the narrow time window between t.sub.0 and t.sub.3 (e.g. 0 to 100 nanoseconds). No plotting points will be produced for assumed values of x outside the window of t.sub.3 -t.sub.0, but since the segment 40a is assumed to be linear (for the first order model), the graph can be extended by means of linear extrapolation to predict the probability of error for larger values of x (e.g. on the order of 1 microsecond or greater) exceeding t.sub.3 -t.sub.0. The observed values of constants A and B can then be inserted into the formula: EQU MTBF=1/[f.sub.C f.sub.D A e.sup.-Bx ]
to solve for MTBF given x values much greater than t.sub.3.
It was hoped that the circuit 30 of FIG. 3A could be used for accurately calculating the mean time between failure (MTBF) in a repeatable manner for different clock frequencies f.sub.C and different data frequencies f.sub.D. Laboratory experiments failed to uphold this expectation. When measurements were taken, a different plot, i.e. 40b, having different A and B parameters was produced with each test run. The different plots appeared to be shifted in time (x scale) by amounts on the order of perhaps a few nanoseconds. These minor shifts in time were believed due to the resolution limit of the variable delay circuit 38. But no way for improving the resolution of delay circuit 38 was apparent. Even minute shifts on the order of a fraction of a nanosecond can shift values along the MTBF scale by as much as an order of magnitude or more. Results obtained with test circuit 30 were not repeatable, and as such, a question arose with respect to how accurately the results can predict the MTBF of critical logic devices and what confidence can be placed on such predictions. A more accurate method for predicting the MTBF of digital decision making circuits is needed.
A test circuit comprising an analog type of voltage controlled variable delay circuit has been proposed by F. Rosenberger and T. J. Chaney in their paper "Flip-Flop Resolving Time Test Circuit", IEEE J. Solid State Circuits, Vol. SC-17, No. 4, August 1982. The analog-based test circuit failed to produce consistent results over multiple test runs and thus little confidence could be placed in MTBF predictions made with such a circuit. A different approach had to be found.