The general problem is the reliability of multiprocessor systems, in the logic of which transient faults can occur and lead to failures. For example, there may be a fault in the logic of one of the processors. Such transient faults may be due to temporary disruptions such as falling neutron or proton particles, to radiation such as gamma radiation or to inductive noise on the power supply. Indeed, current multiprocessor technologies are increasingly sensitive to such disruptions, due to the ever higher level of integration in terms of surface density of transistors and total number of transistors. In order to facilitate critical applications with a high level of reliability, it is desirable to guard against these transient faults, which may be propagated in memory.
In an attempt to solve the problems associated with transient faults, techniques based on memory Error Correcting Codes, or ECC in English, have been developed. These techniques are flexible, since the correction power of the code can be adapted to the targeted environmental conditions and the expected level of reliability. In addition, they are easy to implement since the coder/decoder is shared for all memory locations, which enables a low surface overhead to be generated for control. A major drawback of these techniques is that they can only be used due to the regularity of typical memory structure. Unfortunately the errors occurring in processor logic (as opposed to memory) do not offer such regularity.
Other approaches have been explored in an attempt to enhance the reliability of the logic of multiprocessor systems, notably approaches based on spatial duplication, approaches based on multisampling or ‘pointing-oriented’ approaches, better known in English as checkpointing.
Approaches based on spatial duplication exist in several variants, but the common idea is to perform the desired calculation simultaneously on several identical logic circuits and to react in the event of observing a difference in the outputs. One variant consists in having two instances of the circuit to be protected, associated with a detection mechanism on at least one of the instances for determining which of the two instances has suffered the error. This spatial duplication variant, however, has several drawbacks. First of all, the logic has to be duplicated and as soon as a transient error has occurred, the two instances have then diverged, which requires adding a system for resynchronizing the two instances. In addition, the error detection is on the critical path of the data stream, which is detrimental to performance and requires a very fast detector to be chosen, at the expense of its complexity and its error coverage.
Another variant is to have three instances in parallel and a majority vote at the output. This method avoids placing a detector in one of the instances like the two-instance method previously described, but it displays a majority vote system on the critical path of the data stream, which again is detrimental to performance. In addition, the tripling of the logic is very expensive in surface.
Approaches based on multisampling consist in replacing all the flip-flops of a circuit with special flip-flops for sampling the signal several times. Statistically, in the event of temporary disruption and if the system is properly dimensioned, i.e. if its operating frequency is not too high, the conditions can be met for there to be little chance that an error would affect all the samples. There are basically two variants of multisampling: pre-sampling and post-sampling. In all cases, these methods are expensive in surface and fault tolerance is partial and difficult to achieve.
Indeed, a major drawback of pre-sampling is that it limits the operating frequency of the system and hence its performance. But in the event of divergence, the second sample is statistically more likely to be correct, since many transient faults result in an increased latency. Pre-sampling is therefore a method of fault detection and probable fault tolerance.
While in the case of post-sampling, the fault can only be detected, not tolerated. This is one of its major drawbacks.
Finally, checkpointing-oriented approaches, according to the English expression, consist notably in periodically placing the data of the security monitored system in a storage memory, with the object of reusing them later if needed for recovering the system state. In the rest of the present application, the term “checkpointing approach” or “checkpointing system” will be used for designating a checkpointing-oriented approach or a system implementing such an approach. In the rest of the present application, all the data stored in a storage step implemented as part of a checkpointing approach will be referred to simply as a “checkpoint”. Checkpointing approaches can be used to put the monitored system back to a state prior to the occurrence of the fault and all its consequences. In order to create a system tolerant to transient faults in logic, it is further necessary to combine the checkpointing system with fault or error detectors. This checkpointing approach then assumes that the monitored system has not suffered any faults and detection is performed in parallel with the function of the monitored block. Detection is then referred to as “outside the critical path,” which maximizes performance while it remains possible to cancel actions. If the assumption that the monitored system has operated properly proves correct, then it simply continues its execution. Otherwise, the monitored system stops its operation and its fault-free state with all its consequences is then restored.
Checkpointing approach variants are distinguished firstly by the extent of their recovery capacity. For example, some checkpointing systems are limited by the extent of a processor, the English term “rollback” then being used. In this case, it is possible to undo incorrect actions in the processor, but all actions outside the processor, such as reading and writing to the memory space, cannot be canceled. This checkpointing approach must therefore be combined with fault or error detectors with very low latency, optionally at the expense of detection coverage. Other checkpointing systems extend over more extensive systems than the single processor. This then allows high detection latency and it can be used to maintain high performance due to the fact that the detection is performed outside the critical path.
Checkpointing approach variants are also distinguished by the control policy. In the case of a multiprocessor system with several memory modules, each processor and each memory module manages its own control independently whether for verification or storage. The global checkpointing policy may then vary from one system to another: it may be coordinated or uncoordinated.
Coordinated approaches offer to create global and coordinated checkpoints for the whole system. Checkpoints are thus consistent by construction and therefore rapidly obsolete, which tends to reduce the number of checkpoints stored simultaneously and thus to reduce the volume of storage. However, when a component or application requires a checkpoint, it takes the whole system into this decision. While this behavior is acceptable in simple contexts, e.g. when there are few processors and a few unconnected applications, it becomes unacceptable when the system increases in complexity, e.g. in cases of multiprocessors and/or multiapplications. Thus, this coordinated approach easily leads to a situation where the “global worst case” has to be managed, i.e. where the cost (in memory and performance) of synchronization becomes predominant since checkpoints become very frequent and where concurrently the checkpoints to be stored are very bulky since they are global.
Conversely, an uncoordinated checkpoint policy is possible. In this approach, checkpoints are created at the most appropriate times in an uncoordinated way on the various components of the monitored system. If recovery proves necessary, then a set of checkpoints must be determined, more specifically one checkpoint per component, which has the property of consistency as described by K. Mani Chandy and Leslie Lamport in “Distributed Snapshots: Determining Global States of Distributed Systems” (ACM Transactions on Computer Systems, Vol. 3, No. 1, February 1985, Pages 63-75). In an extreme case, if it is not possible to find a consistent set of checkpoints, then the chosen rollback state is the initial state of the system through the “domino effect”. The advantages of this uncoordinated approach are that the checkpoints are chosen in a targeted way per component, which generates less overhead in synchronization and local checkpointing. In addition, the storage of checkpoints is globally less bulky. Finally, there is no “global worst case” effect typical of the coordinated approach. On the other hand, checkpoints are not consistent by construction, which makes the obsolescence of checkpoints slow or zero, in any case difficult to determine. This means that the volume of storage is a priori unbounded, which is problematic, especially in embedded situations. The eligibility of this approach is thus closely linked to the application context, which is still a major drawback.