The present invention relates generally to system reliability, and more specifically, to reliability circuits using threshold based comparison.
Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random access memory (DRAM) to spontaneously flip to the opposite state. The majority of one-off (“soft”) errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read/write them.
Field programmable gate arrays (FPGAs) are general-purpose logic devices comprising a variety of interconnectable logic resources that are configurable by the end-user to perform a wide variety of functions. Typical FPGAs comprise three types of configurable elements: configurable logic blocks (CLBs), input/output blocks, and interconnects. FPGAs that rely on static latches for their programming elements, also known as SRAM FPGAs, are reconfigurable, meaning they can be reprogrammed with the same or different configuration data. Application specific integrated circuits (ASICs) and Anti-fuse FPGAs cannot be reconfigured.
Manufacturers of systems expected to be exposed to significant levels of radiation, including space-bound systems, favor the lower cost, easier and faster system development, and increased performance of commercial off-the-shelf technology such as SRAM FPGAs. Concerns arise, however, with the ability of technology designed for use on earth to perform reliably in a high-radiation environment. Such reliability is measured in terms of susceptibility to long-term absorption of radiation, referred to as total ionizing dose (TID), and effects caused by the interaction of a single energetic particle, referred to as single event effects (SEE). An SEE occurs when a single particle strikes a sensitive point on a susceptible device and deposits sufficient energy to cause either a hard or soft error. A soft error, or single event upset (SEU) occurs when a transient pulse or bit flip in a device causes an error detectable at the device output. SEUs may alter the logic state of any static memory element (latch, flip-flop, or RAM cell). Since the user-programmed functionality of an SRAM FPGA depends on the data stored in millions of configuration latches within the device, an SEU in the configuration memory array may have adverse effects on the expected functionality.
Techniques used for mitigating, detecting and correcting the effects of SEUs in a circuit depend on the criticality, sensitivity, and nature of the system in question. Known mitigation techniques for use in memory and other data-related devices include parity checking and use of a Hamming, Reed-Solomon (RS), or convolutional code schemes. SEU mitigation in control-related devices is somewhat more difficult because they are, by nature, more vulnerable to SEUs and often more critical to spacecraft operation. Common control-related SEU mitigation techniques include redundant systems, watchdog timers, error detection and correction (EDAC), and current limiting. Unfortunately, many of these techniques for mitigating SEU effects in SRAM FPGAs tend to require substantial configurable logic block (CLB) resources, and can disrupt device and user function.
System redundancy involves multiple identical systems operating in lockstep with synchronized clocking. Errors, which might otherwise not be immediately noticeable, are detected when outputs disagree. Two identical systems in lockstep operation provide minimal protection, and, by way of correction, both systems must be reinitialized when an error is detected. Threefold redundancy is used because, based on the assumption that any two of the three devices will always be error free, only the device whose output disagrees with the other two need be reconfigured. Thus, the system is able to continue functioning on two of the devices during the short interval needed to reconfigure the upset device.
A voting scheme makes threefold redundancy possible: a voting circuit chooses the output agreed upon by a majority of the devices and disregards the remaining device if its output disagrees with that of the majority. Such a triple modular redundancy (TMR) voting scheme has been SEU-tested for systems employing FPGAs, but requires over two-thirds of the FPGAs' gates, or multiple runs through one or more modules with each result stored until the repetition has completed. Unfortunately, the voting circuit, if implemented in SRAM cells, is itself susceptible to SEU effects. Additionally, these systems are susceptible to multiple simultaneous errors, resulting in cases where no vote can be made because the three copies each have different values, or worse, the case where an incorrect vote is made because two copies arrive at the same incorrect value. Furthermore, applying TMR techniques to internal flip-flops alone is insufficient by itself because it may very well be the circuit that precedes the flip-flops that fails, thereby causing all three redundant flip-flops to load the same incorrect value.
Design mitigation techniques, such as triple redundancy, can harden functionality against single event upsets. However, mitigation techniques alone do not correct the erroneous results of SEUs and such errors can accumulate over time. Error detection techniques include reading back the entire configuration data memory and performing a bit-for-bit comparison against data known to be correct. Error correction techniques include complete reconfiguration of the entire configuration data memory using data known to be correct. Both techniques are inefficient, can require additional hardware, can require substantial configurable logic block (CLB) resources, and can disrupt device and user function.