1. Field of the Invention
The present invention relates to techniques for handling errors in a data processing apparatus, and more particularly relates to a data processing apparatus and method for automatically handling hard errors that occur in a cache of the data processing apparatus.
2. Description of the Prior Art
There are many applications for data processing systems where fault tolerance is an important issue. One such application is in safety critical systems, for example automotive systems that control air bags, braking systems, etc. One particular area of fault tolerance is tolerance to errors that can occur in the data stored within the data processing system. A typical data processing apparatus may include one or more storage devices used to store data values used by the data processing apparatus. As used herein, the term “data value” will be used to refer to both instructions executed by a processing device of the data processing apparatus, and the data created and used during execution of those instructions.
The storage devices within the data processing apparatus are vulnerable to errors. These errors may be soft errors, as for example may be caused by neutron strikes, where the state of data held in the storage device can be changed, but the storage device will still write and read data correctly. Such soft errors are also referred to as transient faults. Alternatively, the errors may be hard errors, as for example caused by electro-migration, in which the affected memory location(s) within the storage device will always store an incorrect data value, and the error cannot be corrected by re-writing the data value to the storage device location(s). Such hard errors are also referred to as permanent faults. Both soft errors and hard errors can often be corrected using known error correction techniques, so that the correct data value can be provided to the requesting device, for example a processor core. However, for the example of a hard error, if the corrected data value is then written back to the same memory location, it will again be stored incorrectly at that memory location, since the hard error stems from a fault in the storage device itself.
As well as permanent faults and transient faults, another type of error which can occur is an intermittent fault, such a fault for example being caused by certain environmental conditions in which the storage device operates. Whilst those fault triggering environmental conditions are present, the intermittent fault appears as a hard error, but the fault disappears when the environmental conditions change to be more favourable.
As process geometries shrink, and accordingly the storage devices become smaller and smaller, those storage devices become increasingly vulnerable to errors, and hence it is becoming increasingly important in fault tolerant systems to provide robust techniques for detecting such errors. For example, the articles “Impact of Deep Submicron Technology on Dependability of VLSI Circuits” by C Constantinescu, 0-7695-1597-5/02/$1700 (C) 2002 IEEE, and “Reliability Challenges for 45 nm and Beyond” by J W McPherson, DAC 2006, Jul. 24-28, 2006, San Francisco, Calif., USA, identify that reduced process geometries give rise to higher occurrences of faults, especially transient and intermittent faults.
Often, hard error faults occur due to manufacturing defects. Accordingly, it is known to perform certain hard error detection techniques at production time in order to seek to identify such hard errors. As an example, the article “Nonvolatile Repair Caches Repair Embedded SRAM and New Nonvolatile Memories” by J Fong et al, Proceedings of the 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT'04) describes a non-volatile repair cache that can be used to repair random defective memory cells in embedded SRAMs and other memory devices. The repair cache takes the form of a direct mapped cache having multiple entries used to identify predetermined repair addresses. When an access request is issued by a processing unit, the memory address specified by that access request is compared with the predetermined repair addresses identified in the various entries of the repair cache, and in the event of a hit the access proceeds with respect to the data held in a register bank of the repair cache, with the main memory's write or read signal being blocked. In the event of a repair cache miss, then the write or read operations will be executed within the main memory bank. In addition to a direct mapped repair cache, an n way set associative repair cache is also discussed. The repair cache is populated at wafer test stage, i.e. during production. Accordingly, whilst the described technique can be used to redirect accesses to addresses where hard errors are detected at production time, the technique does not assist in handling hard errors that occur after production, for example due to process variation and aging, nor is it of any assistance in handling soft errors.
To assist in the detection and handling of errors occurring post production, it is known to store error correction code (ECC) data or the like (generally referred to as error data herein) which can be stored in association with the data values, for reference when seeking to detect any errors in those stored data values.
One known error correction technique which makes use of such error data applies an error correction operation to data values when they are read out from the storage device, and before the data values are supplied to the requesting device. If an error is detected, the process aims to correct the data value using the associated error data and then supplies the corrected data to the requesting device. However, typically the corrected data is not written back to the storage device itself, nor is any attempt made to determine whether the error was a soft error or a hard error.
Whilst such an “in-line” correction technique can handle both hard and soft errors provided they are correctable (i.e. provided sufficient redundant information is available to be able to calculate what the true data value is), it suffers from a number of disadvantages. Firstly, additional logic is required on the read path, and this can adversely affect the timing of the read operation, and also adversely affects power consumption. Such an approach may also require control logic to stall the device performing the read operation (for example a processor pipeline). Additionally, because the data in the storage device is not corrected, there is a possibility that further errors could occur, and that the accumulating errors may change over time from being correctable to uncorrectable, or even undetectable. To seek to address this issue, some data processing systems provide an error “scrubber” mechanism that is used to periodically test and correct the data stored in the storage device. However, this mechanism requires time, and consumes energy.
As an alternative to such an in-line mechanism as described above, an alternative mechanism that could be attempted would be to detect and correct the data value when it is read, to store the corrected data value back to the memory device, and then to retry the read operation (referred to herein as a correct and retry mechanism). In the case of a soft error, this has the effect of correcting the data in the storage device, and hence when the read operation is retried, the correct data is read. However, if the error is a hard error, then the error will re-occur when the read is retried, and the operation will hence enter a loop where the data value is corrected, but continues to be wrong when re-read from the storage device. In this situation there is the potential for the system to “spin-lock”, trapped in a loop of accessing, attempting correction and retrying, unless mechanisms are in place to spot such a behaviour and break out of the loop.
Whilst the above issues are generally applicable to any type of storage device provided within the data processing apparatus, further specific issues can arise if the storage device in question is a cache. One or more caches are often provided within a data processing apparatus to temporarily store data values required by a processing unit of the data processing apparatus so as to allow quick access to any such cached data values. As is known in the art, the cache will typically consist of a plurality of cache lines, and for each cache line storing valid data, an address identifier is provided within the cache identifying an address portion which is shared with all of the data values in that cache line. When an access request is issued specifying a memory address associated with a cacheable region of memory, a lookup procedure will be performed in the cache to seek to identify whether a portion of the memory address specified in the access request matches an address identifier in the cache, and if it does the access may proceed directly in the cache without the need to access the memory.
If a write through (WT) mode of operation is used for the cache lines, then any write updates made to the cache line contents will be replicated in memory so as to maintain consistency between the cache contents and the memory contents. However, if a write back (WB) mode of operation is employed, then any updates made to the contents of a cache line are not immediately replicated in the corresponding locations in memory. Instead, only when a cache line is later evicted, is the relevant data in memory brought up to date with the contents in the cache line (the need to do this is typically indicated by a dirty bit value, which is set if the cache line contents are written to whilst stored in the cache).
Considering the issue of hard or soft errors occurring in a cache, then as with other storage devices error correction code data can be stored in association with the cache contents with the aim of enabling errors to be detected.
If the cache can be arranged as a write through cache, then there are two possible approaches that can be taken on detection of an error in a particular cache line. In accordance with a first technique (which will be referred to herein as an “assume miss and invalidate” approach), the access can simply be considered to have missed in the cache. The data will then be retrieved from a lower level in the memory hierarchy. At the same time, in order to prevent errors accumulating in the cache, the cache line is invalidated. The data retrieved may typically be streamed into the device requesting the data, for example the processor core, but often will be reallocated into the cache. If the original error occurred as the result of a hard error, and the refetched data from memory is allocated into the same cache line, then the next time the data is accessed in the cache, the same error is likely to be detected again. This will potentially cause significant performance degradation.
In accordance with a second, alternative, technique for a write through cache (referred to as an “invalidate and retry” mechanism), on detection of an error in a particular cache line, that cache line can merely be invalidated and the access retried without the need to seek to perform any correction on the data held in the cache line. When the access is retried, a miss will occur in the cache, and the data will be retrieved from a lower level in the memory hierarchy. As with the first technique, this retrieved data may typically be streamed into the device requesting the data, for example the processor core, but often will be reallocated into the cache, so that a cache hit will occur on the next access. If the original error occurred as the result of a hard error then, when the access is retried, the same error is likely to be detected again. The processor will get stuck in a spinlock, continually retrying the access and detecting the error.
The problems become even more complex if the cache is at least partially a write back cache, since if an error is detected in a cache line using such a write back mechanism, then it is not merely sufficient to invalidate the cache line, but instead the cache line contents must first be corrected and then evicted to memory. Accordingly the “assume miss and invalidate” approach that can be applied to a write through cache cannot be used for a write back cache, because the cache line with the error in it may be valid and dirty, and hence if the first technique were used the dirty data would be lost. The “invalidate and retry” approach can be used, but as part of the invalidate operation the cache line will need to be corrected (i.e. a correct and retry style operation is needed). This applies not only to the data values in the cache line itself, but also to the associated address identifier, and associated control data such as the valid bit indicating if the cache line is valid and the dirty bit indicating if the cache line is dirty, since all of these contents may potentially be subject to errors. Hence, by way of example, if the valid bit is itself corrupted by an error, the cache line that holds valid data may appear from the associated valid bit to not hold valid data. Accordingly, when adopting a write back mode of operation in a cache, it may be necessary to perform error detection and correction even on cache lines that on face value appear to be invalid.
A number of papers have been published concerning the detection and handling of errors occurring in caches. For example, the article “PADded Cache: A New Fault-Tolerance Technique for Cache Memories”, by P Shirvani et al, Center for Reliable Computing, Stanford University, 17th (1999) IEEE VLSI Test Symposium, describes a technique that uses a special programmable address decoder (PAD) to disable faulty blocks in a cache and to re-map their references to healthy blocks. In particular, a decoder used in a cache is modified to make it programmable so that it can implement different mapping functions. A group of flip-flops within the decoder are connected as a shift register and loaded using special instructions. Accordingly, it will be appreciated that the approach described therein is one that would be employed as part of a Built-In Self Test (BIST) procedure, and hence requires the faulty blocks in the cache to be identified, and the programmable address decoder programmed, prior to normal operation of the data processing apparatus. The technique can hence not be used to handle errors that only manifest themselves during normal operation.
The article “Performance of Graceful Degradation for Cache Faults” by H Lee et al, IEEE Computer Society Annual Symposium on VLSI(ISVLSI'07) examines several strategies for masking faults, by disabling faulty resources such as lines, sets, ways, ports or even the whole cache. A cache set remapping scheme is also discussed for recovering lost performance due to failed sets. As explained in Section 5.2, it is assumed that the faults in the cache memory are detected and necessary cache reconfiguration is done before program execution. Hence, as with the earlier-mentioned article, the techniques described therein cannot be used to handle errors that manifest themselves during normal operation, for example soft errors, or hard errors that occur for example through aging.
The article “Power4 System Design for High Reliability” by D Bossen et al, IBM, pages 16 to 24, IEEE Micro, March-April 2002, provides a general discussion of fault tolerance, and describes some specific schemes employed in association with a cache. A level 1 data cache is identified which is arranged as a store-though design (equivalent to the write through design mentioned earlier), so as to allow error recovery by flushing the affected cache line and refetching the data from a level 2 cache. The paper also discusses use of hardware and firmware to track whether the particular ECC mechanism corrects permanent errors beyond a certain threshold, and after exceeding this threshold the system creates a deferred repair error log entry. Using these error log entries, mechanisms such as a cache line delete mechanism can be used to remove a faulty cache line from service. A BIST-based mechanism is also described where programmable steering logic permits access to cache arrays to replace faulty bits. Hence, it can be seen that the techniques described in this paper involve either arranging the cache as a simple write through cache, or alternatively require the need for complex techniques to maintain logs of errors and make decisions based on the log entries, such techniques consuming significant power and taking up significant area within the data processing apparatus. Moreover, it implies that the development and configuration of a dedicated firmware is the appropriate way to handle faults. There are many applications where such power and area hungry mechanisms will not be acceptable. Further, there is no discussion of the earlier-mentioned problems that can occur particularly in write back caches, and in particular no discussion as to how hard errors in such write back caches could be handled.
With the above issues in mind, commonly owned co-pending U.S. patent application Ser. No. 12/004,476 (the entire contents of which are hereby incorporated by reference) describes a mechanism for handling errors occurring within a cache of a data processing apparatus, which can yield improved performance relative to the earlier-mentioned “in-line” correction mechanisms, and which can be used not only in association with write through caches but also write back caches. In accordance with the technique described therein, a cache location avoid storage having at least one record is provided within the data processing apparatus, with the cache location avoid storage being populated during normal use of the data processing apparatus. If an error condition, is detected when accessing a cache line of the cache, then a record in the cache location avoid storage is allocated to store the cache line identifier for that cache line in which the error condition was detected. Further, a clean and invalidate operation is performed in respect of that cache line and the access is then re-performed. When performing lookup operations in the cache, the cache access circuitry excludes from that lookup procedure any cache line identified in the cache location avoid storage.
Through use of the technique described in U.S. application Ser. No. 12/004,476, it can be ensured that errors occurring in the cache storage do not cause incorrect operation when accesses are performed in respect of the cache storage, whilst allowing the advantages of an invalidate and retry/correct and retry mechanism to be retained, such as the fact that the error detection mechanism can be provided on a separate path to the normal data retrieval path (providing both power and timing benefits). However, entries are made in the cache location avoid storage irrespective of whether the errors causing those entries to be made are soft errors or hard errors, and whilst such an approach ensures that if the error detected was in fact a hard error it cannot cause operability problems in the operation of the cache storage, it also results in cache lines being excluded unnecessarily, if the error in that cache line was in fact a soft error.
U.S. Pat. No. 4,506,362 describes a systematic data memory error detection and correction apparatus that periodically reads data from each addressable memory location, determines the presence or absence of an error in the addressed data memory location and, if an error is detected, corrects the error and writes the corrected data back into the addressed memory location. The apparatus may include circuitry for logging those areas of the data memory where errors have been detected, such logging showing either the address location where an error is detected or alternatively indicating the repetitiveness of an error at any particular addressed memory location. Such data logging can facilitate the determination of hard errors rather than soft errors. In one example use case, it is indicated that upon detection of a hard error, it will be possible to re-map the address space at the chip level, with an entire chip select signal being switched to another chip location where some redundant storage is provided.
However, by performing error detection and correction on a periodic basis, this can have a significant impact on the performance of the memory, for example in situations where errors are only occurring relatively infrequently and hence much of the detection and correction processing performs no useful result. Further, no error containment mechanism is provided based on the logging information, other than discussing the possibility of switching to a completely separate chip location in the event that a hard error is deduced from the logging information. However, such an approach would require the provision of a significant amount of redundant storage, which would be unduly expensive for many implementations. Furthermore, the approach described cannot guarantee that an error will not be present at the time any particular data is used by associated processing circuitry (for example because an error occurs between the periodic error detection and correction process, and in that interim period the data is accessed by the processing circuitry). In such a situation, there will be an error in the data as accessed by the processing circuitry, and that error will not be detected at the time of use of that data by the processing circuitry.
The article “Discriminating Fault Rate and Persistency to Improve Fault Treatment” by A Bondavalli et al, Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97), describes a mechanism designed to discriminate intermittent and permanent faults against low rate, low persistency transient faults, with the aim of improving fault treatment, and so the overall system performance. The mechanism employs a count and threshold approach, coupled with a simple decay algorithm to take into account the timing traits of intermittent faults. The system described includes a number of redundant components, and the aim of the described mechanism is to notify all components affected by permanent or intermittent faults (referred to as faulty units) as quickly as possible, whilst avoiding notifying any units other than faulty units (such units being referred to as healthy units). Healthy units are those that are only affected by temporary transient faults.
Hence, in accordance with the mechanism described in the above paper, a count is incremented each time a permanent or intermittent fault is detected for a component, and when that count reaches a threshold value, the corresponding unit is identified as a faulty unit. However, the unit will need to be reutilised several times in order for the count in respect of that unit to reach the threshold value, and it may hence take a significant period of time before any particular unit is identified as a faulty unit. The probability of re-triggering the fault is very dependent on the workload of the unit and cannot be enforced by the system. Furthermore, the technique would not appear to be applicable to cache type structures where the occurrence of an error does not result from the use of the cache per se, but depends on which parts of the cache are accessed.
It would be desirable to provide an improved technique for identifying hard errors in the various cache records of a cache storage.