Various fields of application of integrated circuits (e.g., applications in biomedical, automotive and avionics sectors) require the corresponding systems to operate in “zero failure” conditions, i.e., prove practically immune in regard to disturbance, such as, for example, impulsive noise due to external radiation (cosmic rays, microwaves, ultraviolet rays, electromagnetic fields of various nature, etc.) and to phenomena of internal migration of energy and of coupling between conductors (said phenomena being particularly important in the case of submicrometric technologies and for nanoscale design). One of the possible manifestations of “fault tolerance” is the capacity for a system to respond “gracefully” to an unexpected failure both of a hardware nature and of a software nature.
There exist, of course, various levels of fault tolerance. Among these the lowest is the capacity for continuing to function in the case of lack of supply. Various computer systems with fault-tolerance characteristics operate according to a redundancy scheme: each operation is performed by two or more duplicated systems in such a way that, if a system is affected by a failure, the other system or systems can stand in to ensure continuity of proper operation. Said computers of a fault-tolerant type always present a certain level of duplication of hardware so that, if a component is affected by failure, at least one duplicated component is able to stand in immediately without having to deactivate the computer. Computers with a high degree of fault tolerance can consequently be rather costly and complex to design.
In applications such as control of nuclear power stations or piloting of aircraft (applications in which safety is crucial and absolute reliability is an indispensable need) the choice of solutions of this nature is in effect imperative. In other applications, it is instead possible to think of intermediate-level solutions, where, in the case of a failure of some component part, the system is able to continue to function, possibly with a reduced level of performance, without completely ceasing to function. Solutions of this type are used in particular for computer-based systems for which, in the presence of some failure, a continuation of operation is acceptable albeit with a reduced throughput and/or an increase in the response times: in other words, in the presence of hardware and/or software problems, the system is not completely shut down, but a certain level of operation is in any case preserved. It is on this basis that certain systems operate, for example, systems for application in the automotive sector, which are designed to allow a motor vehicle to continue to move, perhaps at a lower speed, if one of the tires is punctured. In said general framework, the concept of fault tolerance nowadays assumes a particular importance in sectors such as the biomedical sector or the automotive sector on the basis of a paradigm that fundamentally envisages the presence of a certain degree of redundancy (i.e., the presence of more resources than are strictly necessary) in such a way as to allow a redundant resource to stand in for a resource affected by a failure. In the specific case of electronic circuits such as integrated circuits, fault-tolerance techniques are implemented mainly by replicating the system resources at least at critical nodes of a calculation chain and, at times, by replicating the entire structure, giving rise in practice to a parallel calculation/processing structure.
By way of example it is possible to cite solutions like the one described in US2001/0034854 A1, which illustrates a processor that is able to execute simultaneously the same instruction set on two separate threads so as to produce an adequate level of fault tolerance. One thread is processed before the other one allocates the readings not subjected to caching in a reading queue. The thread that operates with delay performs the same operation, and then the two readings not yet cached are compared. If there is coincidence, one of the readings passes to the main memory of the system; otherwise, the presence of a failure is identified and a recovery procedure starts up.
The solution described in U.S. Publication No. US2004/0030953 A1 functions instead, by storing a minimum code set in a protected memory in such a way that—if the programming process in the framework of the circuit is subject to fault—the instruction set can be executed again starting from the protected field of the memory. In one embodiment, a series of multiplexers is provided for switching selectively between a normal code sequence and the protected one. A watchdog timer monitors the programming process within the circuit for determining possible faults in the development of processing of the instructions.
These solutions fit within the concept (which is very costly but, at least in principle, altogether safe) that is at times referred to as Evolvable HardWare (EHW). This is basically a design criterion of digital circuits inspired by concepts drawn from the biological sciences, which envisages a hardware organism with a number of layers or levels in which each cell contains the complete genotype of the circuit (see in this regard the article by M. Hartmann et al. “Evolution of fault-tolerant and noise-robust digital designs”, IEE Proc.-Comput. Digit. Tech., vol. 151, No. 4, July 2004).
In the application to electronic circuits, such as for example integrated circuits, fault-tolerance techniques aim at taking into account the noise of an impulsive type that afflict said circuits, in particular with the capacity of propagating through the circuit itself. One of the most widely known models of said impulsive noise (in particular, as regards integrated circuits or ICs) is represented by alpha particles. In effect, it has been shown that approximately 85% of the faults that can be found in a system can be caused by transient faults, with the alpha particles at the basis of the transients that create the biggest trouble. The transient failures or faults are temporary ones (i.e., non-permanent ones) that are likely to arise in a circuit during its operation on account of the effect of various internal and external sources of noise. These failures or faults are intrinsically different from the failures or faults introduced in the course of production of a circuit (which generally prevent operation of the circuit in a stable way): the transient failures or faults act only for a short interval of time in the framework of a circuit that for the rest would function normally. In particular, in digital systems these failures or faults can be produced by internal sources of noise, such as for example supply transients, phenomena of capacitive and inductive diaphony, or else by external sources of noise, such as for example the effect of particles or cosmic rays (such as precisely alpha particles).
The study of said phenomena of disturbance entails the use of mathematical models, such as for example a double exponential function. For a treatment in this regard reference may be made, for example, to the article by F. L. Yang et al.: “Simulation and Analysis of Transient Faults in Digital Circuits”, IEE Journal of Solid-State Circuits, vol. 27, No. 3, 1992.
Even though an alpha particle generates a pulse of very short duration, the corresponding effect can be to a certain extent amplified by the phenomena of delay of the internal gates of the circuit. Consequently, an effect of noise that initially would not be particularly harmful for operation of a complex digital circuit can become an important source of disturbance after being propagated through the logic gates of the circuit (for example NAND, NOR gates, or logic inverters).
For completeness of treatment, it may again be mentioned that there are in themselves known circuits that envisage detection of the delays of propagation of signals within integrated circuits with feedback function of various nature. In this regard, reference may be made to the following documents:                D. Ernst et al.: “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation”, Proceedings of the 36th International Symposium on Microarchitecture (MICRO-36′03);        D. Blaauwl et al.: “Razor II. In Situ Error Detection and Correction for PVT and SER Tolerance”, 2008 IEE International Solid-State Circuit Conference;        S. Lee et al.: “Reducing Pipeline Energy Demands with Local DVS and Dynamic Retiming”, ISLPED'04, Aug. 9-11, 2004, Newport Beach, Calif., USA.        