1. Field of the Invention
The disclosed technology relates to the field of mitigation techniques in systems prone to soft errors or data corruption of any other nature.
2. Description of the Related Technology
Modern technical systems such as household goods, DVD players, PCs, medical X-ray imaging, printers, advanced car vehicles and airplanes rely increasingly on intelligence realized with software running on embedded computing hardware. Embedded computer programs monitor the whole system and take care that the system accomplishes more than its parts would. In such software-intensive systems reliability is of prime importance. Complexity increase through increased feature integration, product life decrease and a trend towards increasingly open systems press a need for better development methods to ensure continued reliable products.
There is continuous pressure to improve the user-perceived functional reliability of such consumer electronic products, especially for streaming-type applications. This should result in minimizing the number of product failures exposed to the user. For most systems that are centralized and not consumer-related, this mitigation is quite well feasible, as the tolerated cost and performance (real-time, area and energy related) impact can be relatively major. In the consumer market, however, the cost-sensitivity and real-time nature prohibit this. Moreover, in portable systems also the battery life time is critical. So, also energy overhead should be reduced wherever possible. That context makes the functional reliability mitigation particularly challenging.
System-level soft errors occur when the data being processed is hit with a noise phenomenon, typically when the data is on a data bus. The system tries to interpret the noise as a data bit, which can cause errors in addressing or processing program code. A soft error will not damage a system's hardware; the only damage is to the data that is being processed. The bad data bit can even be saved in memory and cause problems at a later time. Software errors are most commonly known in semiconductor storage.
This disclosure however not only considers soft errors. It deals in general with data corruption of any nature affecting reliability. Other possible causes of transient and intermittent errors can be supply noise variation, temperature-induced variations, aging/degradation effects (due to e.g. bias temperature instability (BTI), time-dependent dielectric breakdown (TDDB) both in high-k devices and low-k wires, hot carrier injection (HCl), soft oxide breakdown (SBD), random telegraph noise (RTN), electro-migration in wires . . . ), etc.
A field wherein techniques for reducing impact on reliability are particularly relevant is that of real-time multi-media or wireless streaming applications. They form an important target market with a volume of hundreds of millions of chips per year. It can easily motivate a targeted domain-specific approach. The entire considered target application domain is much broader though, encompassing all applications with sufficient data storage (data or loop dominated processing) and exhibiting at least one design constraint (like real-time processing). This includes especially the multimedia, wireless, biomedical and automotive subdomains, but it is not limited to thereto. The reliability mitigation is especially crucial when IC feature size is shrunk to levels where such degradations and transient errors have a noticeable effect on the correct functional operation of embedded systems-on-chip (SoCs). That shrinking is essential to continue the necessary cost reduction for future consumer products. Several studies indicate that one is very close to such alarming levels for both aging related faults (e.g. BTI/RTN, HCl and TDDB below 20 nm) and Soft Errors (SE), even for more than single-event upset (SEU) disturbances. The largest impact of these SEs is clearly situated in the on-chip SRAM memories for storage of both data and instructions. Conventionally a fault rate of error occurrence per time unit is then defined. One is mainly interested in the impact on the system level outputs, so the fault rate is defined as the number of faults induced at the system-level outputs for a specified unit of time. That is typically based on a statistical model of the fault induced at the memory cell level. For the current technologies, the impact on individual registers and logic is still relatively low. The main concern for the SRAM storage relates to the functional requirements. Similar observations can be made for aging related effects due to BTI and so on. Also these mainly affect the on-chip memories.
Because of the area or performance overhead induced, many SoC developers for embedded markets then prefer to risk the impact of faults on their products. This leads however to failures during the life-time usage and to customer complaints. Traditional microprocessors are indeed quite vulnerable to such errors. Ignoring the failures also causes a large indirect cost on the producers. For soft errors that is not a desirable route, e.g. for the analysis of an automotive application. This motivates the search for schemes that are both safe and cost-effective. Several micro-architectural or software schemes with a potentially much smaller hardware overhead have been proposed. None of these existing approaches is really well suited for hard real-time systems where the performance and energy overhead at run-time has to be strictly controlled.
Several main options for mitigation can be identified in the prior art solutions: detection and correction of errors can basically be performed at hardware platform level, at middleware or micro-architectural level or at application level.
(On-Line) Error Detection and Correction at Hardware Platform Level
Most designers of embedded real-time systems traditionally rely on hardware to overcome the SRAM reliability threat. That typically involves error detection and correction codes (ECC). Such solutions meet in principle all constraints, but the hardware cost (area, energy, latency increase) is usually (too) high. That is especially so in case of distributed platforms, which are anyway necessary for energy-sensitive embedded systems. Hence, for energy- and cost-sensitive systems, these hardware schemes are not that attractive. In many cases manufacturers prefer to leave them largely out. In the best case they only protect part of their memory organization with the ECC hardware. Sometimes also hardware schemes are used that do not store the ECC redundancy codes in the same memory as the bits that are to be protected, to avoid modifications in the memory array. For instance, checksums are introduced with storage of the redundant information in separate locations, augmented with a hardware-supported protocol to protect the memories from soft error impact.
Also in the logic and register domain hardware schemes have been proposed to deal with functional failure correction (e.g. for soft errors). They modify the circuit to become more robust or they provide fault tolerance, e.g. by hardware sparing augmented with dynamic voltage scaling (DVS) to reduce the energy overhead, or they modify the scheduling, e.g. where the soft error rate of the registers in the processor is optimized. They are compatible with hard real-time system requirements, but these approaches are not suited to deal with memories. Either they are not applicable for that purpose or too high an overhead would be incurred.
(On-line) Error Detection and Correction at Middleware or Micro-Architectural Level
This is relatively easily feasible when a QoS strategy is allowed. In the literature several schemes of this class have been described with a potentially much smaller hardware overhead. A good example is where a checkpointing mechanism is embedded in the code executed on the microprocessor pipeline. When an error is detected, recovery occurs by “relying on the natural redundancy of instruction-level parallel processors to repair the system so that it can still operate in a degraded performance mode”. Obviously, that is only acceptable if the system is tolerant to such degradation. In “HARE: hardware assisted reverse execution” (I. Doudalis et al., Proc. Int'l. Symp. on High-Perf.Comp. Arch. (HPCA), Bangalore, pp. 107-118, January 2010) the checkpointing is assisted with hardware to reduce the overhead to about 3% on average. Even then much larger peaks can occur. In “Soft error vulnerability aware process variation mitigation” (X. Fu. et al., Proc. Intl. Symp. On High-Perf. Comp. Arch. (HPCA), Bangalore, pp. 93-104, January 2010) the focus lies on soft error mitigation for the registers in the processor by micro-architectural changes. Hence, it is important to fully meet functional and/or timing constraints. Then the incurred overhead will still be too high because it can come at the “wrong” moment just before a deadline. In that hard real-time case, it is only feasible when also the application level is exploited.
(On-line) Error Detection and Correction at Application Level
This starts from a deterministic fully defined algorithmic functionality between inputs and outputs. Although many “pure software” approaches have been proposed in the prior art, they all rely on the possibility to either degrade the functionality or on timing constraints not being fully guaranteed. In particular, they use some form of time redundancy or checkpointing to detect failures and a full rollback mechanism to recover. As a result, in prior art techniques the program tasks may be duplicated on an SMT (Simultaneous Multithreading) multi-core platform with some delay buffering, so that functional errors can be detected. If they occur, the software “works around the errors”. In several pure software schemes based on source code transformations are applied. Another approach modifies the allocation and scheduling algorithm to introduce time redundancy in the synthesized or executed code. Changes in the mapping (especially allocation and scheduling) of the algorithm to the multi-core platform have also been proposed in order to make the execution more robust to failures.
It is clear that in such approaches both the detection and the correction at software level can take considerable cycle overhead once activated. That can also happen at the wrong moment in time, just before a hard deadline occurs. So they again assume a best-effort quality-of-service (QoS) approach, which is feasible for most microprocessor contexts.
In order to deal with failures prior art solutions have relied on the use of several checkpoints and rollback recovery. For example, a delayed commit and rollback mechanism has been proposed to overcome soft errors resulting from different sources such as noise margin violations and voltage emergency occurrence. Stored data are divided in the processor pipeline to two different states, noise-speculative and noise-verified states. Moreover, the solution relies on a violation detector that has a time lag (D) to detect a margin violation. If a data value is in noise-speculative state for a time period D and no violation is detected, it is considered as noise-verified (correct data). Otherwise, it is considered faulty and a rollback to the last verified checkpoint is performed, with flushing all noise-speculative states. This approach has a performance loss that reaches 18%, and the memories are considered as fault-tolerant, thus this technique cannot mitigate memory-based faults.
Some practical examples are now provided to illustrate the problems encountered in the prior art, that certain inventive aspects of this disclosure aim to solve. FIG. 1 represents a task graph of this simple illustrative example for mitigation. One assumes a task with three internal subtasks (blocks) with 400, 600 and 300 cycles, respectively, consuming 1300 cycles in total. The input data contains 16 words, the output data 24 words for each of the blocks in the task. The internally required task data amounts to 200 parameters (e.g. filter coefficients) per set and 520 data words. In the traditional mitigation approaches they both have to be fully protected by hardware ECC, which involves too much area and energy overhead. Alternatively, a software mitigation approach would be needed, which however would not fulfill the stringent timing requirements of a real-time application (see further motivation below).
In the above example the conventional hardware based mitigation reference would store everything in a single layer of two L2 memories, namely one for the parameters (L2P) and one for the data (L2D), e.g. L2P=400 words for two coefficient sets and L2D=520+16+24=560 words. Both of these L2 memories are fully equipped with detection and correction hardware, which leads to a relatively large area, access delay and access energy penalty. This is very hard to motivate in a cost-sensitive consumer application.
In order to avoid the costly hardware ECC protection, one can opt for a pure software scheme. In that case one needs checkpointing based on e.g. a duplication of the tasks (time redundancy) and when an error is detected, an appropriate rollback mechanism is activated. However, this approach takes a considerable cycle overhead and does not provide any guarantee that all the real-time deadlines be fully met. The program will also be extended, which costs some space in the external program memory. Depending on how that is implemented, also that cost can become relevant to consider.
A target application model is now described. Target applications split to different tasks Ti (i=1, 2, . . . , k) can be represented by data flow graphs. FIG. 2 shows two examples of application task graphs, namely aperiodic (FIG. 2a) and periodic task graphs (FIG. 2b), where the tasks execution sequence is repeated every period. If the data generated from task Ti is faulty, the recomputation of the affected data is a promising software-based solution. However, recomputation is a time consuming process for traditional mitigation approaches, implying significant timing and energy overheads. As a result, the quality-of-service (QoS) of the application is degraded significantly either by deadline violation to get error-free data or by discarding the erroneous data to maintain the timing constraint.
For example, consider that the application of the periodic task graph in FIG. 2b is running and let Tij denote, the execution of task Ti at period j. Assuming that task Ti consumes Ni cycles to be completed and the data (DK) generated in task Tij at cycle K (K∈[1,Ni;]) is dependent on the data computed at cycles M, 1≦M<K. If DK is faulty and it is only detected at the end of the task execution (worst-case scenario) at cycle Ni, the whole task must be restarted, and another Ni cycles are needed to recompute this data, which may lead to deadline violation of that task. FIG. 3 shows an example when task T12 has an error. In this example, the recomputation of the whole task causes a large timing overhead that leads to deadline violation of this task. This example demonstrates that complete software-based mitigation techniques are inadequate with applications that do not have enough slack for a complete task rollback and recomputation.
The paper “System-level analysis of soft error rates and mitigation trade-off explorations” (Zhe Ma et al., Proc. Intl Reliability Physics Symp. (IRPS), Anaheim, Calif., pp. 1014-1018, May 2010) presents a system-level analysis of soft error rates based on a Transaction Level Model of a targeted system-on-chip. A transient error (e.g. soft error) analysis approach is proposed which allows accurately evaluating the use of selective protection of system-on-chip SRAM memory.