1. Field of the Invention
The present invention relates to techniques for handling errors in a data processing apparatus, and more particularly relates to a data processing apparatus and method for handling hard errors that occur in a cache of the data processing apparatus.
2. Description of the Prior Art
There are many applications for data processing systems where fault tolerance is an important issue. One such application is in safety critical systems, for example automotive systems that control air bags, braking systems, etc. One particular area of fault tolerance is tolerance to errors that can occur in the data stored within the data processing system. A typical data processing apparatus may include one or more storage devices used to store data values used by the data processing apparatus. As used herein, the term “data value” will be used to refer to both instructions executed by a processing device of the data processing apparatus, and the data created and used during execution of those instructions.
The storage devices within the data processing apparatus are vulnerable to errors. These errors may be soft errors, as for example may be caused by neutron strikes, where the state of data held in the storage device can be changed, but the storage device will still write and read data correctly. Alternatively, the errors may be hard errors, as for example caused by electro-migration, in which the affected memory location(s) within the storage device will always store an incorrect data value, and the error cannot be corrected by re-writing the data value to the storage device location(s). Both soft errors and hard errors can often be corrected using known error correction techniques, so that the correct data value can be provided to the requesting device, for example a processor core. However, for the example of a hard error, if the corrected data value is then written back to the same memory location, it will again be stored incorrectly at that memory location, since the hard error stems from a fault in the storage device itself.
As process geometries shrink, and accordingly the storage devices become smaller and smaller, those storage devices become increasingly vulnerable to errors, and hence it is becoming increasingly important in fault tolerant systems to provide robust techniques for detecting such errors.
Often, hard error faults occur due to manufacturing defects. Accordingly, it is known to perform certain hard error detection techniques at production time in order to seek to identify such hard errors. As an example, the article “Nonvolatile Repair Caches Repair Embedded SRAM and New Nonvolatile Memories” by J Fong et al, Proceedings of the 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'04) describes a non-volatile repair cache that can be used to repair random defective memory cells in embedded SRAMs and other memory devices. The repair cache takes the form of a direct mapped cache having multiple entries used to identify predetermined repair addresses. When an access request is issued by a processing unit, the memory address specified by that access request is compared with the predetermined repair addresses identified in the various entries of the repair cache, and in the event of a hit the access proceeds with respect to the data held in a register bank of the repair cache, with the main memory's write or read signal being blocked. In the event of a repair cache miss, then the write or read operations will be executed within the main memory bank. In addition to a direct mapped repair cache, an n way set associative repair cache is also discussed. The repair cache is populated at wafer test stage, i.e. during production. Accordingly, whilst the described technique can be used to redirect accesses to addresses where hard errors are detected at production time, the technique does not assist in handling hard errors that occur after production, for example due to process variation and aging, nor is it of any assistance in handling soft errors.
To assist in the detection and handling of errors occurring post production, it is known to store error correction code (ECC) data or the like (generally referred to as error data herein) which can be stored in association with the data values, for reference when seeking to detect any errors in those stored data values.
One known error correction technique which makes use of such error data applies an error correction operation to data values when they are read out from the storage device, and before the data values are supplied to the requesting device. If an error is detected, the process aims to correct the data value using the associated error data and then supplies the corrected data to the requesting device. However, typically the corrected data is not written back to the storage device itself, nor is any attempt made to determine whether the error was a soft error or a hard error.
Whilst such an “in-line” correction technique can handle both hard and soft errors provided they are correctable (i.e. provided sufficient redundant information is available to be able to calculate what the true data value is), it suffers from a number of disadvantages. Firstly, additional logic is required on the read path, and this can adversely affect the timing of the read operation, and also adversely affects power consumption. Such an approach may also require control logic to stall the device performing the read operation (for example a processor pipeline). Additionally, because the data in the storage device is not corrected, there is a possibility that further errors could occur, and that the accumulating errors may change over time from being correctable to uncorrectable, or even undetectable. To seek to address this issue, some data processing systems provide an error “scrubber” mechanism that is used to periodically test and correct the data stored in the storage device. However, this mechanism requires time, and consumes energy.
As an alternative to such an in-line mechanism as described above, an alternative mechanism that could be attempted would be to detect and correct the data value when it is read, to store the corrected data value back to the memory device, and then to retry the read operation (referred to herein as a correct and retry mechanism). In the case of a soft error, this has the effect of correcting the data in the storage device, and hence when the read operation is retried, the correct data is read. However, if the error is a hard error, then the error will re-occur when the read is retried, and the operation will hence enter a loop where the data value is corrected, but continues to be wrong when re-read from the storage device. In this situation there is the potential for the system to “spin-lock”, trapped in a loop of accessing, attempting correction and retrying, unless mechanisms are in place to spot such a behavior and break out of the loop.
Whilst the above issues are generally applicable to any type of storage device provided within the data processing apparatus, further specific issues can arise if the storage device in question is a cache. One or more caches are often provided within a data processing apparatus to temporarily store data values required by a processing unit of the data processing apparatus so as to allow quick access to any such cached data values. As is known in the art, the cache will typically consist of a plurality of cache lines, and for each cache line storing valid data, an address identifier is provided within the cache identifying an address portion which is shared with all of the data values in that cache line. When an access request is issued specifying a memory address associated with a cacheable region of memory, a lookup procedure will be performed in the cache to seek to identify whether a portion of the memory address specified in the access request matches an address identifier in the cache, and if it does the access may proceed directly in the cache without the need to access the memory.
If a write through (WT) mode of operation is used for the cache lines, then any write updates made to the cache line contents will be replicated in memory so as to maintain consistency between the cache contents and the memory contents. However, if a write back (WB) mode of operation is employed, then any updates made to the contents of a cache line are not immediately replicated in the corresponding locations in memory. Instead, only when a cache line is later evicted, is the relevant data in memory brought up to date with the contents in the cache line (the need to do this is typically indicated by a dirty bit value, which is set if the cache line contents are written to whilst stored in the cache).
Considering the issue of hard or soft errors occurring in a cache, then as with other storage devices error correction code data can be stored in association with the cache contents with the aim of enabling errors to be detected.
If the cache can be arranged as a write through cache, then there are two possible approaches that can be taken on detection of an error in a particular cache line. In accordance with a first technique (which will be referred to herein as an “assume miss and invalidate” approach), the access can simply be considered to have missed in the cache. The data will then be retrieved from a lower level in the memory hierarchy. At the same time, in order to prevent errors accumulating in the cache, the cache line is invalidated. The data retrieved may typically be streamed into the device requesting the data, for example the processor core, but often will be reallocated into the cache. If the original error occurred as the result of a hard error, and the refetched data from memory is allocated into the same cache line, then the next time the data is accessed in the cache, the same error is likely to be detected again. This will potentially cause significant performance degradation.
In accordance with a second, alternative, technique for a write through cache (referred to as an “invalidate and retry” mechanism), on detection of an error in a particular cache line, that cache line can merely be invalidated and the access retried without the need to seek to perform any correction on the data held in the cache line. When the access is retried, a miss will occur in the cache, and the data will be retrieved from a lower level in the memory hierarchy. As with the first technique, this retrieved data may typically be streamed into the device requesting the data, for example the processor core, but often will be reallocated into the cache, so that a cache hit will occur on the next access. If the original error occurred as the result of a hard error then, when the access is retried, the same error is likely to be detected again. The processor will get stuck in a spinlock, continually retrying the access and detecting the error.
The problems become even more complex if the cache is at least partially a write back cache, since if an error is detected in a cache line using such a write back mechanism, then it is not merely sufficient to invalidate the cache line, but instead the cache line contents must first be corrected and then evicted to memory. Accordingly the “assume miss and invalidate” approach that can be applied to a write through cache cannot be used for a write back cache, because the cache line with the error in it may be valid and dirty, and hence if the first technique were used the dirty data would be lost. The “invalidate and retry” approach can be used, but as part of the invalidate operation the cache line will need to be corrected (i.e. a correct and retry style operation is needed). This applies not only to the data values in the cache line itself, but also to the associated address identifier, and associated control data such as the valid bit indicating if the cache line is valid and the dirty bit indicating if the cache line is dirty, since all of these contents may potentially be subject to errors. Hence, by way of example, if the valid bit is itself corrupted by an error, the cache line that holds valid data may appear from the associated valid bit to not hold valid data. Accordingly, when adopting a write back mode of operation in a cache, it may be necessary to perform error detection and correction even on cache lines that on face value appear to be invalid.
A number of papers have been published concerning the detection and handling of errors occurring in caches. For example, the article “PADded Cache: A New Fault-Tolerance Technique for Cache Memories”, by P Shirvani et al, Center for Reliable Computing, Stanford University, 17th (1999) IEEE VLSI Test Symposium, describes a technique that uses a special programmable address decoder (PAD) to disable faulty blocks in a cache and to re-map their references to healthy blocks. In particular, a decoder used in a cache is modified to make it programmable so that it can implement different mapping functions. A group of flip-flops within the decoder are connected as a shift register and loaded using special instructions. Accordingly, it will be appreciated that the approach described therein is one that would be employed as part of a Built-In Self Test (BIST) procedure, and hence requires the faulty blocks in the cache to be identified, and the programmable address decoder programmed, prior to normal operation of the data processing apparatus. The technique can hence not be used to handle errors that only manifest themselves during normal operation.
The article “Performance of Graceful Degradation for Cache Faults” by H Lee et al, IEEE Computer Society Annual Symposium on VLSI (ISVLSI'07) examines several strategies for masking faults, by disabling faulty resources such as lines, sets, ways, ports or even the whole cache. A cache set remapping scheme is also discussed for recovering lost performance due to failed sets. As explained in Section 5.2, it is assumed that the faults in the cache memory are detected and necessary cache reconfiguration is done before program execution. Hence, as with the earlier-mentioned article, the techniques described therein cannot be used to handle errors that manifest themselves during normal operation, for example soft errors, or hard errors that occur for example through aging.
The article “Power4 System Design for High Reliability” by D Bossen et al, IBM, pages 16 to 24, IEEE Micro, March-April 2002, provides a general discussion of fault tolerance, and describes some specific schemes employed in association with a cache. A level 1 data cache is identified which is arranged as a store-though design (equivalent to the write through design mentioned earlier), so as to allow error recovery by flushing the affected cache line and refetching the data from a level 2 cache. The paper also discusses use of hardware and firmware to track whether the particular ECC mechanism corrects permanent errors beyond a certain threshold, and after exceeding this threshold the system creates a deferred repair error log entry. Using these error log entries, mechanisms such as a cache line delete mechanism can be used to remove a faulty cache line from service. A BIST-based mechanism is also described where programmable steering logic permits access to cache arrays to replace faulty bits. Hence, it can be seen that the techniques described in this paper involve either arranging the cache as a simple write through cache, or alternatively require the need for complex techniques to maintain logs of errors and make decisions based on the log entries, such techniques consuming significant power and taking up significant area within the data processing apparatus. There are many applications where such power and area hungry mechanisms will not be acceptable. Further, there is no discussion of the earlier-mentioned problems that can occur particularly in write back caches, and in particular no discussion as to how hard errors in such write back caches could be handled.
Accordingly, it would be desirable to provide a simple and effective mechanism for handling errors occurring within a cache of a data processing apparatus, which can yield improved performance relative to the earlier-mentioned “in-line” correction mechanisms, and which can be used not only in association with write through caches but also write back caches.