1. Field of the Invention
The present invention relates to a data processing apparatus and method for analysing transient faults occurring within storage elements of a data processing apparatus.
2. Description of the Prior Art
Many modern data processing systems require mechanisms to be put in place to detect occurrences of faults. Some faults, such as an undesired short or open circuit, will cause a permanent fault within the system. However, not all faults are permanent in nature. In particular, transient faults, also called single event upsets (SEUs), may occur due to electrical noise or external radiation. Radiation can, directly or indirectly, induce localised ionisation events capable of upsetting internal data states. While the upset causes a data error, the circuit itself is undamaged and the system experiences a transient fault. The data upsetting is called a soft error, and detection of soft errors is of significant concern in many applications.
A data processing apparatus will typically include one or more storage structures containing storage elements, and an SEU may cause a flip in the state of a storage element, hence giving rise to a transient fault. Given the sources of radiation that can give rise to SEUs in a typical working environment of a data processing system, the SEUs are random in nature. Further, given the high clock speeds at which data processing systems typically operate, many clock cycles will typically elapse between each transient fault caused by an SEU.
A number of known techniques have been developed for providing resilience to such random transient faults within a data processing apparatus. For example, error detection mechanisms such as parity checking mechanisms can be used to detect whether a data value has changed between the time it was written into a storage element and the time it is read out from that storage element, such a change indicating a transient fault. Once a transient fault has been detected, a suitable corrective action can be taken, for example flushing the relevant parts of the data processing apparatus and re-executing instructions where required.
In addition to simple fault detection mechanisms, a number of known techniques have been developed which not only detect a random transient fault, but also correct that fault in situ so as to provide improved performance. For example, error correction code (ECC) mechanisms can be used to detect and correct such errors, and indeed ECC mechanisms are often used in memory arrays for this purpose. Other known techniques include redundancy techniques such as dual modular redundancy (DMR) or triple modular redundancy (TMR), which are often used in association with storage elements provided within a data path of a data processing system. These redundancy techniques can be implemented as spatial redundancy or time redundancy, or a combination of both.
Such fault detection and correction circuits can be embodied in a variety of ways, but in one particular example may take the form of a single event upset (SEU) tolerant flip-flop such as discussed in commonly owned U.S. Pat. No. 7,278,080, the entire contents of which are hereby incorporated by reference, this patent describing a design technique sometimes referred to as “Razor”. A further paper that describes the Razor technique is “Razor II: In-Situ Error Detection and Correction for PVT and SER Tolerance”, IEEE Journal of Solid-State Circuits (JSSC), Volume 44, No. 1, January 2009.
Whilst transient faults occur randomly due to the environment in which the data processing system is operated, it is also known to seek to inject transient faults in a more systematic manner in order to seek to compromise security of a data processing system. For example, the security of processor cores and cryptographic engines can be compromised by injecting deliberate faults by attackers. By such an approach, a fault injected into a secure chip can expose security keys and/or secure data to the attacker. Known fault injection attack techniques are described in a number of prior art articles, such techniques including optical fault attacks using a laser gun, flash camera, X-rays and ion beams, as for example discussed in the article by S. P. Skorobogatov and R. J. Anderson, entitled “Optical fault induction attacks”, Proceedings of the 4th International Workshop on Cryptographic Hardware and Embedded Systems (CHES), 2002. Some other articles that provide a useful summary of fault injection attacks are:
R. Leveugle, “Early Analysis of Fault-Based Attack Effects in Secure Circuits”, IEEE Transaction on Computers, Vol. 56, No. 10, October 2007.
Dan Boneh, Richard A. DeMillo, and Richard J. Lipton, “On the Importance of Checking Cryptographic Protocols for Faults”, Advances in Cryptology—EUROCRYPT '97, International Conference on the Theory and Application of Cryptographic Techniques, May 1997.
O. Kömmerling and M. G. Kuhn, “Design principles for Tamper-Resistant Smartcard Processors”, Proceedings of the USENIX Workshop on Smartcard Technology, 1999.
B. M. Gammel and S. J. Rüping, “Smart Cards Inside”, Proceedings of the 31st European Solid-State Circuits Conference, 2005. ESSCIRC, 2005.
Agoyan, M.; Dutertre, J.-M.; Mirbaha, A.-P.; Naccache, D.; Ribotta, A.-L.; Tria, “A Single-bit DFA using multiple-byte laser fault injection”, IEEE International Conference on Technologies for Homeland Security (HST), 2010.
Attackers typically use transient fault injection sources in a coordinated way by scanning the surface of a chip either in its de-packaged or packaged state. When a coordinated photon or ion hits a storage cell, it may flip the bit stored in the cell causing a transient fault.
As pointed out in the article by J. McDermott, A. Kim and J. Froscher, entitled “Merging Paradigms of Survivability and Security: Stochastic Faults and Designed Faults”, Proceedings of the 2003 Workshop on New Security Paradigms, 2003, there have until now been two classes of communities dealing with the issues of faults, namely the reliability and fault tolerance community, and the security community. Whilst both communities consider the issue of how to deal with faults, the communities have traditionally had strikingly different views of the type of faults that exist, the way they are modelled, and how they are addressed, and the above article proposes that security and fault tolerance should be considered together.
The article by H. Bar-El, H. Choukri, D. Naccache, M. Tunstall, and C. Whelan, entitled “The Sorcerer's Apprentice Guide to Fault Attacks”, Proceedings of Workshop on Fault Detection and Tolerance in Cryptography, June 2004, describes the use of spatial hardware and software redundancy techniques used by the fault tolerance community, for example DMR and TMR techniques, in order to detect security-related faults. The proposed techniques either replicate hardware blocks (N modular) or use the same hardware block but re-execute multiple times, in order to detect security-related faults. The faults are detected and corrected locally, and the article does not discuss differentiating a security-related transient fault attack from random transient faults occurring naturally.
The article by C. R. Moratelli, É. Cota, M. S. Lubaszewski, entitled “A Cryptography Core Tolerant to DFA Fault Attacks”, Proceeding SBCCI '06 Proceedings of the 19th Annual Symposium on Integrated Circuits and Systems Design, 2006, describes cryptographic hardware in the form of a DES (Data Encryption Standard) core being assisted by an accompanying hardware block that is a partially duplicated version of the DES core, in order to detect and correct injected transient faults by re-execution. As with the earlier-discussed article “The Sorcerer's Apprentice Guide to Fault Attacks”, this article uses redundancy mechanisms (the partially duplicated core) and re-execution, and the described techniques do not provide for any differentiation between a security-related transient fault attack and random transient faults occurring naturally.
In the unrelated field of distributed sensors within a sensor network, the article by C Basile, M Gupta, Z Kalbarczyk and R Iyer entitled “An approach for Detecting and Distinguishing Errors versus Attacks in Sensor Networks”, Proceedings of the 2006 International Conference on Dependable Systems and Networks, DSN'06, describes a technique for dealing with corrupted sensor data, whilst using Hidden Markov Models (HMMs) to diagnose and characterise the nature of detected anomalies with the aim of distinguishing errors from attacks. In particular, by use of these powerful mathematical models, the described technique aims to distinguish faulty data from malicious data in a distributed sensor network. However, HMMs need the models of each error and attack type in order to properly identify each attack and each error. The error types that can be detected using the techniques described in the article are non-random errors such as “stuck-at-value”, “calibration” and “additive” errors. These errors are by their nature not transient and not random and indeed towards the end of section 3 of the article it is noted that it is difficult to correctly classify a random noise error as there is no fixed pattern observed in the observation symbol probability distribution. Accordingly, such an approach based on HMMs would not be useful within the earlier described field of data processing systems in order to seek to distinguish between random transient faults occurring naturally and a coordinated transient fault attack.
Accordingly, it would be desirable to develop a technique which would enable transient faults occurring within storage elements of a data processing apparatus to be analysed in order to seek to distinguish random transient faults from a coordinated transient fault attack.