Electronic circuits, in particular digital circuits having for example semiconductor components, are exposed to external influences that can cause undesired changes in their behavior. A correct, fault-free behavior of the circuit can be expected by the user when all operating parameters, such as operating voltage, temperature, mechanical load, etc., are within the specified limits. If one or more parameters are outside these limits, systematic faulty behavior may be observed.
However, faulty behavior can also be triggered by other external influences, such as electromagnetic radiation or high-energy particles such as cosmic radiation, radioactive decay products, etc. The frequency of occurrence of such radiation influence is a function in particular of the location at which the circuit is used (on the surface of the earth, elevation above sea level, vicinity to particular sources of radiation), and of the sensitivity of the circuit itself. Here it should be kept in mind that the sensitivity of the circuit generally increases strongly as the structural size of the circuit components decreases.
Occurring faults can be divided into two groups, namely permanent faults, which bring about a lasting change in the circuit and therefore a defect, and transient faults, which cause a temporary change in the state or behavior of the circuit.
Transient faults can in turn be divided into two groups:
Single-event transient (SET): brief disturbing impulse in the voltage level of a line;
Single-event upset (SEU): inversion or change in the state or of the information in the memory cells.
There are many scientific publications that deal with the fault masking of SEUs, in particular in microprocessors. Here, the term “Architectural Correct Execution” bit (ACE) is defined. ACE bits are all memory cells that have an effect on the system output in the case of a fault.
Alongside this, all bits that cannot influence the instruction path within the processor are designated “microarchitectural un-ACE” bits. These can occur in idle states, during speculative calculation, and in predictive structures (predictors). Frequently, values calculated there are not used, and therefore also have no effect (un-ACE).
As a third group, “architectural un-ACE” bits are defined, which do have an effect on the result of a single instruction, but have no effect on the system output. These can occur in the case of NOP (no operation) instructions, performance-increasing indications, such as prefetch, instructions with predicate register, logic-masking effects of the operands, and so-called dynamically dead instructions. Here there is a further distinction between “first-level dynamically dead instructions” (FDD), e.g., two write accesses to the same address without reading of the first value between the two accesses, and “transitively dynamically dead instructions” (TDD), which produce results that are used only by FDDs or TDDs.
Concerning the above, reference is made to the publication of Shubhendu S. Mukherjee, Christopher Weaver, Joel Emer, Stephen K. Reinhart, Todd Austin: “A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor,” IEEE, 2003.
Combinatorial digital circuits are distinguished by their deterministic behavior. This has the consequence that with a given digital logic function and given input values, the output value can be unambiguously determined. If a transient fault occurs in one or more input signals to a logic function with an output (1 bit), a faulty output signal may occur, as a function of the input signals and the logic function. Whether a particular fault causes a deviation from the expected behavior of the circuit at one of the outputs, i.e., the fault becomes visible, is referred to as observability, or fault observability. Here it is to be noted that not every fault becomes visible as a faulty output; this is referred to as masking, or fault masking.
The sensitivity of a specific combination of input signals relative to a specific fault can be determined using the Boolean difference. If the Boolean difference for a function input is equal to 1, a change in this input signal will cause a change in the output signal. In general, one speaks of a sensitive path from an input to an output if a change in this one input signal causes a change in the output signal.
Boolean Function:ƒ(x1, . . . , xn)ε{0,1},xiε{0,1}
Boolean Difference:
            ⅆ      f              ⅆ              x        i              =            f      ⁡              (                              x            1                    ,          …          ⁢                                          ,                      x            i                    ,          …          ⁢                                          ,                      x            n                          )              ⊕          f      ⁡              (                              x            i                    ,          …          ⁢                                          ,                                    x              _                        i                    ,          …          ⁢                                          ,                      x            n                          )            
The result of the Boolean difference for each input signal, the temporal portion of the occurrence of a specific input combination, and the probability of a fault of an individual signal, together enable the calculation of a fault probability or fault masking probability. In the case of a multistage logic, the results of the individual stages must be compensated using correlation.
Reference is made here to the publication of Ming Zhang, Naresh R. Shanbhag: A Soft Error Rate Analysis (SERA) Methodology, IEEE, 2004, US 2007/0226572 A1.
For sequential circuits (synchronous circuit technology), the time characteristic also plays a large role. Thus, in every larger circuit there are a large number of nodes that are not important for the functioning of the circuit at every point in time. Therefore, fault masking effects can also be observed over time. The properties of the circuit prevent a portion of the occurrent faults from being visible at the output. The ratio of visible faults to actually occurring faults is referred to as the derating factor.
In this thematic area, the following terms are used:
Timing Derating (TD):
Timing derating is an effect that arises due to the runtime of a signal from a register or latch to the next register or latch, i.e., during the running through of a stage, in a synchronous circuit design.
Due to the runtime of a signal through the logic gates and lines (logic path) between two storage elements (register or latch), faults (SEUs) that occur at the beginning, at a register or latch, of this logic path do not always reach the end of this path in a timely manner at the sampling time. In this case, this fault is also not propagated into the next stage of the circuit, but rather is masked out.
The excess time for the propagation of a signal within a synchronous circuit stage (clock period tClk−signal runtime through the logic path tDelay) is referred to as slack. All SEUs at the storage element at the beginning of the logic path that occur less than tDelay before the sampling time of the storage element at the end of the logic path have no effect on the value of the sampled signal. Therefore, the ratio of the slack to the clock period can be regarded as the timing derating factor.
Logic Derating (LD):
So-called logic derating is the reduction of visible faults in relation to the actual number of faults on the basis of the overall logical function of a circuit. Logic derating is a function both of the use of the circuit and of the architecture of the circuit itself. Whenever a register content is faulty, but its state is no longer further processed, one speaks of logic derating, and the information of the clock gating or from the branch prediction can be used in a processor. Here, the designations “soft error sensitivity factors” or “vulnerability factors” are also alternatively used.
Reference is made here to the publication of Hang T. Nguyen, Yoad Yagil, Norbert Seifert, Mike Reitsma: Chip-Level Error Estimation Method, IEEE, 2005.
If all masking effects under consideration are combined in a single factor, one speaks of an Architectural Vulnerability Factor (AVF). The probability that a fault of a particular component will influence the circuit output is calculated here from the base fault rate, which is dependent on the technology, multiplied by the AVF.
Reference is made here to the publication of Shubhendu S. Mukherjee, Christopher Weaver, Joel Emer, Stephen K. Reinhart, Todd Austin: “A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor,” IEEE, 2003.
In addition to analytical methods, using circuit simulation it is possible to determine an overall masking factor by feeding faults into the circuit. Here, the progression of all output signals of the circuit for a fixed set of input stimuli is stored. This is used as a reference for the fault-free operation of the circuit.
In fault injection, faulty values are introduced into the circuit in a stochastically distributed manner over the entire circuit and over the entire simulation time period. After the feeding of a single fault into the progression of a signal at a fault location, the simulation is regularly continued, and the output vector, namely the totality of all output signals, is observed for a predefined time period. Within this time period, the output vector is compared to the fault-free reference as a target value, and possible differences are noted. If there is at least one visible fault, this simulation run is evaluated as faulty. The connection of fault location and effect at the output is stored.
Fault injection must be carried out in the context of an entire campaign, i.e., many simulation runs using different faults. The results obtained in this way are now combined for each fault location. Per fault location, the number of simulation runs containing faults relative to the number of fed-in faults is examined. This ratio is the fault masking factor for a signal.
The determination of a masking factor through fault injection requires a very high computing expense, because in a simulation it is always possible to draw only one conclusion for a specific fault. The precision of the results due to fault injection is a function of the number of simulation runs, namely the number of injected faults. A high degree of statistical precision is achieved only beginning from a high number.
U.S. Application Publication No. US 2005/0283950 A1 describes a method for reducing faulty detection of faults in microprocessors through the tracing of so-called dynamically dead instructions. In this method, it is monitored whether a given instruction is a dynamically dead instruction. In this way, false positives can also be reduced.
In addition to faults that occur during circuit operation, manufacturing faults in the circuits must also be recognized. The recognition of circuit faults takes place, as a rule, through a test in the production facility, and possibly during, or at the beginning, of circuit operation, by applying defined test patterns. However, in the production of these test patterns it is often not yet known which faults are recognized using the test pattern set. The tracing of critical paths (Critical Path Tracing, CPT) in integrated circuits having combinatorial functioning has been carried out for many years to make it possible to determine the test coverage of a test pattern set. In CPT, using the Boolean difference sensitive paths are calculated, beginning from the primary outputs and going to the primary inputs. In this method, many scientific publications also take into account in particular the effects of reconvergent paths. In general, these paths are represented and analyzed by creating a reconvergence graph. By taking into account the specific structure and properties of the graph, the effects of self-masking and multiple-path stimulation can be taken into account. CPT yields as a result all sensitive paths of a circuit for a circuit state. A sensitive path means that all circuit nodes in this sensitive path are observable, i.e., a fault would become visible in the form of a deviating output signal. From this it can be inferred that the input signals of the circuit state currently being examined are a test vector for stuck-at faults of the opposed (negated) momentarily present digital signal level of all circuit nodes in all sensitive paths (e.g., signal level is logical 1 for test stuck-at 0, and vice versa). CPT can therefore be used for the fast parallel determination of the test coverage (fault grading) of combinatorial circuits. Through an expansion, CPT can also be used for sequential circuits; here, fault lists of possibly detectable faults are stored in sequential elements and propagated forward. The faults contained in these lists are not detectable until these fault lists reach a primary output. Because many fault lists on non-sensitive paths are erased, a large unnecessary computing expense is incurred.
In this connection, reference is made to the publication of Lei Wu, D. M. H. Walker: A Fast Algorithm for Critical Path Tracing in VLSI Digital circuits, IEEE, 2005, and the publication of P. Menon, Y. Levendel, M. Abramovici: SCRIPT: A Critical Path Tracing Algorithm for Synchronous Sequential Circuits, IEEE, 1991.