1. Field of the Invention
The present invention relates to error correction of data values stored in a data storage device.
2. Description of the Prior Art
There are many applications for data processing systems where fault tolerance is an important issue. One such application is in safety critical systems, for example automotive systems that control air bags, braking systems, etc. One particular area of fault tolerance is tolerance to errors that can occur in the data stored within the data processing system. A typical data processing apparatus may include one or more storage devices used to store data values used by the data processing apparatus. As used herein, the term “data value” will be used to refer to both instructions executed by a processing device of the data processing apparatus, and the data created and used during execution of those instructions.
The storage devices within the data processing apparatus are vulnerable to errors. These errors may be soft errors, as for example may be caused by neutron strikes, where the state of data held in the storage device can be changed, but the storage device will still write and read data correctly. Alternatively, the errors may be hard errors, as for example caused by electro-migration, in which the affected memory location(s) within the storage device will always store an incorrect data value, and the error cannot be corrected by re-writing the data value to the storage device location(s). Both soft errors and hard errors can often be corrected using known error correction techniques, so that the correct data value can be provided to the requesting device, for example a processor core. However, for the example of a hard error, if the corrected data value is then written back to the same memory location, it will again be stored incorrectly at that memory location, since the hard error stems from a fault in the storage device itself.
As process geometries shrink, and accordingly the storage devices become smaller and smaller, those storage devices become increasingly vulnerable to errors, and hence it is becoming increasingly important in fault tolerant systems to provide robust techniques for detecting such errors.
Often, hard error faults occur due to manufacturing defects. Accordingly, it is known to perform certain hard error detection techniques at production time in order to seek to identify such hard errors. As an example, the article “Nonvolatile Repair Caches Repair Embedded SRAM and New Nonvolatile Memories” by J Fong et al, Proceedings of the 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'04) describes a non-volatile repair cache that can be used to repair random defective memory cells in embedded SRAMs and other memory devices. The repair cache takes the form of a direct mapped cache having multiple entries used to identify predetermined repair addresses. When an access request is issued by a processing unit, the memory address specified by that access request is compared with the predetermined repair addresses identified in the various entries of the repair cache, and in the event of a hit the access proceeds with respect to the data held in a register bank of the repair cache, with the main memory's write or read signal being blocked. In the event of a repair cache miss, then the write or read operations will be executed within the main memory bank. In addition to a direct mapped repair cache, an n way set associative repair cache is also discussed. The repair cache is populated at wafer test stage, i.e. during production. Accordingly, whilst the described technique can be used to redirect accesses to addresses where hard errors are detected at production time, the technique does not assist in handling hard errors that occur after production, for example due to process variation and aging, nor is it of any assistance in handling soft errors.
To assist in the detection and handling of errors occurring post production, it is known to store error correction code (ECC) data or the like (generally referred to as error data herein) which can be stored in association with the data values, for reference when seeking to detect any errors in those stored data values.
One known error correction technique which makes use of such error data applies an error correction operation to data values when they are read out from the storage device, and before the data values are supplied to the requesting device. If an error is detected, the process aims to correct the data value using the associated error data and then supplies the corrected data to the requesting device. However, typically the corrected data is not written back to the storage device itself, nor is any attempt made to determine whether the error was a soft error or a hard error.
Whilst such an “in-line” correction technique can handle both hard and soft errors provided they are correctable (i.e. provided sufficient redundant information is available to be able to calculate what the true data value is), it suffers from a number of disadvantages. Firstly, additional logic is required on the read path, and this can adversely affect the timing of the read operation, and also adversely affects power consumption. Such an approach may also require control logic to stall the device performing the read operation (for example a processor pipeline). Additionally, because the data in the storage device is not corrected, there is a possibility that further errors could occur, and that the accumulating errors may change over time from being correctable to uncorrectable, or even undetectable. To seek to address this issue, some data processing systems provide an error “scrubber” mechanism that is used to periodically test and correct the data stored in the storage device. However, this mechanism requires time, and consumes energy.
As an alternative to such an in-line mechanism as described above, an alternative mechanism that could be attempted would be to detect and correct the data value when it is read, to store the corrected data value back to the memory device, and then to retry the read operation (referred to herein as a “correct and retry” mechanism). In the case of a soft error, this has the effect of correcting the data in the storage device, and hence when the read operation is retried, the correct data is read. However, if the error is a hard error, then the error will re-occur when the read is retried, and the operation will hence enter a loop where the data value is corrected, but continues to be wrong when re-read from the storage device. In this situation there is the potential for the system to “spin-lock”, trapped in a loop of accessing, attempting correction and retrying, unless mechanisms are in place to spot such a behavior and break out of the loop.
Three other articles discuss varieties of error correction in the context of caches: “PADded Cache: A New Fault-Tolerance Technique for Cache Memories”, by P Shirvani et al, Center for Reliable Computing, Stanford University, 17th (1999) IEEE VLSI Test Symposium; “Performance of Graceful Degradation for Cache Faults” by H Lee et al, IEEE Computer Society Annual Symposium on VLSI (ISVLSI'07); and “Power4 System Design for High Reliability” by D Bossen et al, IBM, pages 16 to 24, IEEE Micro, March-April 2002.
It is desirable to provide an improved manner of handling errors occurring in data values stored in a data storage device.