1. Field of the Invention
The present invention relates to techniques for handling errors in a data processing apparatus having a cache storage and a replicated address storage.
2. Description of the Prior Art
There are many applications for data processing systems where fault tolerance is an important issue. One such application is in safety critical systems, for example automotive systems that control air bags, braking systems, etc. One particular area of fault tolerance is tolerance to errors that can occur in the data stored within the data processing system. A typical data processing apparatus may include one or more storage devices used to store data values used by the data processing apparatus. As used herein, the term “data value” will be used to refer to both instructions executed by a processing device of the data processing apparatus, and the data created and used during execution of those instructions.
The storage devices within the data processing apparatus are vulnerable to errors. These errors may be soft errors, as for example may be caused by neutron strikes, where the state of data held in the storage device can be changed, but the storage device will still write and read data correctly. Such soft errors are also referred to as transient faults. Alternatively, the errors may be hard errors, as for example caused by electro-migration, in which the affected memory location(s) within the storage device will always store an incorrect data value, and the error cannot be corrected by re-writing the data value to the storage device location(s). Such hard errors are also referred to as permanent faults. Both soft errors and hard errors can often be corrected using known error correction techniques, so that the correct data value can be provided to the requesting device, for example a processor core. However, for the example of a hard error, if the corrected data value is then written back to the same memory location, it will again be stored incorrectly at that memory location, since the hard error stems from a fault in the storage device itself.
As well as permanent faults and transient faults, another type of error which can occur is an intermittent fault, such a fault for example being caused by certain environmental conditions in which the storage device operates. Whilst those fault triggering environmental conditions are present, the intermittent fault appears as a hard error, but the fault disappears when the environmental conditions change to be more favourable.
As process geometries shrink, and accordingly the storage devices become smaller and smaller, those storage devices become increasingly vulnerable to errors, and hence it is becoming increasingly important in fault tolerant systems to provide robust techniques for detecting such errors. For example, the articles “Impact of Deep Submicron Technology on Dependability of VLSI Circuits” by C Constantinescu, 0-7695-1597-5/02/$1700 (C) 2002 IEEE, and “Reliability Challenges for 45 nm and Beyond” by J W McPherson, DAC 2006, Jul. 24-28, 2006, San Francisco, Calif., USA, identify that reduced process geometries give rise to higher occurrences of faults, especially transient and intermittent faults.
Often, hard error faults occur due to manufacturing defects. Accordingly, it is known to perform certain hard error detection techniques at production time in order to seek to identify such hard errors. Further, to assist in the detection and handling of errors occurring post production, it is known to store error correction code (ECC) data or the like (generally referred to as error data herein) which can be stored in association with the data values, for reference when seeking to detect any errors in those stored data values.
One known error correction technique which makes use of such error data applies an error correction operation to data values when they are read out from the storage device, and before the data values are supplied to the requesting device. If an error is detected, the process aims to correct the data value using the associated error data and then supplies the corrected data to the requesting device. However, typically the corrected data is not written back to the storage device itself, nor is any attempt made to determine whether the error was a soft error or a hard error.
Whilst such an “in-line” correction technique can handle both hard and soft errors provided they are correctable (i.e. provided sufficient redundant information is available to be able to calculate what the true data value is), it suffers from a number of disadvantages. Firstly, additional logic is required on the read path, and this can adversely affect the timing of the read operation, and also adversely affects power consumption. Such an approach may also require control logic to stall the device performing the read operation (for example a processor pipeline). Additionally, because the data in the storage device is not corrected, there is a possibility that further errors could occur, and that the accumulating errors may change over time from being correctable to uncorrectable, or even undetectable. To seek to address this issue, some data processing systems provide an error “scrubber” mechanism that is used to periodically test and correct the data stored in the storage device. However, this mechanism requires time, and consumes energy.
As an alternative to such an in-line mechanism as described above, an alternative mechanism that could be attempted would be to detect and correct the data value when it is read, to store the corrected data value back to the memory device, and then to retry the read operation (referred to herein as a correct and retry mechanism). In the case of a soft error, this has the effect of correcting the data in the storage device, and hence when the read operation is retried, the correct data is read. However, if the error is a hard error, then the error will re-occur when the read is retried, and the operation will hence enter a loop where the data value is corrected, but continues to be wrong when re-read from the storage device. In this situation there is the potential for the system to “spin-lock”, trapped in a loop of accessing, attempting correction and retrying, unless mechanisms are in place to spot such a behaviour and break out of the loop.
Whilst the above issues are generally applicable to any type of storage device provided within the data processing apparatus, further specific issues can arise if the storage device in question is a cache. One or more caches are often provided within a data processing apparatus to temporarily store data values required by a processing unit of the data processing apparatus so as to allow quick access to any such cached data values. As is known in the art, the cache will typically consist of a plurality of cache lines, and for each cache line storing valid data, an address identifier is provided within the cache identifying an address portion which is shared with all of the data values in that cache line. When an access request is issued specifying a memory address associated with a cacheable region of memory, a lookup procedure will be performed in the cache to seek to identify whether a portion of the memory address specified in the access request matches an address identifier in the cache, and if it does the access may proceed directly in the cache without the need to access the memory.
If a write through (WT) mode of operation is used for the cache lines, then any write updates made to the cache line contents will be replicated in memory so as to maintain consistency between the cache contents and the memory contents. However, if a write back (WB) mode of operation is employed, then any updates made to the contents of a cache line are not immediately replicated in the corresponding locations in memory. Instead, only when a cache line is later evicted, is the relevant data in memory brought up to date with the contents in the cache line (the need to do this is typically indicated by a dirty bit value, which is set if the cache line contents are written to whilst stored in the cache).
Considering the issue of hard or soft errors occurring in a cache, then as with other storage devices error correction code data can be stored in association with the cache contents with the aim of enabling errors to be detected.
If the cache can be arranged as a write through cache, then there are two possible approaches that can be taken on detection of an error in a particular cache line. In accordance with a first technique (which will be referred to herein as an “assume miss and invalidate” approach), the access can simply be considered to have missed in the cache. The data will then be retrieved from a lower level in the memory hierarchy. At the same time, in order to prevent errors accumulating in the cache, the cache line is invalidated. The data retrieved may typically be streamed into the device requesting the data, for example the processor core, but often will be reallocated into the cache. If the original error occurred as the result of a hard error, and the refetched data from memory is allocated into the same cache line, then the next time the data is accessed in the cache, the same error is likely to be detected again. This will potentially cause significant performance degradation.
In accordance with a second, alternative, technique for a write through cache (referred to as an “invalidate and retry” mechanism), on detection of an error in a particular cache line, that cache line can merely be invalidated and the access retried without the need to seek to perform any correction on the data held in the cache line. When the access is retried, a miss will occur in the cache, and the data will be retrieved from a lower level in the memory hierarchy. As with the first technique, this retrieved data may typically be streamed into the device requesting the data, for example the processor core, but often will be reallocated into the cache, so that a cache hit will occur on the next access. If the original error occurred as the result of a hard error then, when the access is retried, the same error is likely to be detected again. The processor will get stuck in a spinlock, continually retrying the access and detecting the error.
The problems become even more complex if the cache is at least partially a write back cache, since if an error is detected in a cache line using such a write back mechanism, then it is not merely sufficient to invalidate the cache line, but instead the cache line contents must first be corrected and then evicted to memory. Accordingly the “assume miss and invalidate” approach that can be applied to a write through cache cannot be used for a write back cache, because the cache line with the error in it may be valid and dirty, and hence if the first technique were used the dirty data would be lost. The “invalidate and retry” approach can be used, but as part of the invalidate operation the cache line will need to be corrected (i.e. a correct and retry style operation is needed). This applies not only to the data values in the cache line itself, but also to the associated address identifier, and associated control data such as the valid bit indicating if the cache line is valid and the dirty bit indicating if the cache line is dirty, since all of these contents may potentially be subject to errors. Hence, by way of example, if the valid bit is itself corrupted by an error, the cache line that holds valid data may appear from the associated valid bit to not hold valid data. Accordingly, when adopting a write back mode of operation in a cache, it may be necessary to perform error detection and correction even on cache lines that on face value appear to be invalid.
With the above issues in mind, commonly owned co-pending U.S. patent application Ser. No. 12/004,476 (the entire contents of which are hereby incorporated by reference) describes a mechanism for handling errors occurring within a cache of a data processing apparatus, which can yield improved performance relative to the earlier-mentioned “in-line” correction mechanisms, and which can be used not only in association with write through caches but also write back caches. In accordance with the technique described therein, a cache location avoid storage having at least one record is provided within the data processing apparatus, with the cache location avoid storage being populated during normal use of the data processing apparatus. If an error condition is detected when accessing a cache line of the cache, then a record in the cache location avoid storage is allocated to store the cache line identifier for that cache line in which the error condition was detected. Further, a clean and invalidate operation is performed in respect of that cache line and the access is then re-performed. When performing lookup operations in the cache, the cache access circuitry excludes from that lookup procedure any cache line identified in the cache location avoid storage.
Through use of the technique described in U.S. application Ser. No. 12/004,476, it can be ensured that errors occurring in the cache storage do not cause incorrect operation when accesses are performed in respect of the cache storage, whilst allowing the advantages of an invalidate and retry/correct and retry mechanism to be retained, such as the fact that the error detection mechanism can be provided on a separate path to the normal data retrieval path (providing both power and timing benefits).
Another issue that can arise in systems employing caches is coherency. In particular it is known to provide a multi-processing system in which two or more processing devices, for example processor cores, share access to shared memory. Such systems are typically used to gain higher performance by arranging the different processor cores to execute respective data processing operations in parallel. To further improve speed of access to data within such a multi-processing system, it is known to provide one or more of the processing devices with its own local cache in which to store a subset of the data held in the shared memory. Whilst this can improve speed of access to data, it complicates the issue of data coherency. In particular, it will be appreciated that if a particular processor performs a write operation with regards to a data value held in its local cache, that data value will be updated locally within the cache, but may not necessarily also be updated at the same time in the shared memory. This is for example the case if the data value in question relates to a write back region of memory, in which case the updated data value in the cache will only be stored back to the shared memory when that data value is subsequently evicted from the cache.
Since the data may be shared with other processing devices, it is important to ensure that those processing devices will access the up-to-date data when seeking to access the associated address in shared memory. To ensure that this happens, it is known to employ a cache coherency protocol within the multi-processing system to ensure that if a particular processor updates a data value held in its local cache, that up-to-date data will be made available to any other processing device subsequently requesting access to that data.
In accordance with a typical cache coherency protocol, certain accesses performed by a processing device will require a coherency operation to be performed. The coherency operation will cause a notification to be sent to one or more of the other processing devices that have their own local caches, identifying the type of access taking place and the address being accessed. This will cause those other processing devices to perform certain actions defined by the cache coherency protocol. One such action is the invalidation of a cached data value, indicating that this data value has become out-of-date due to the actions of the other processing devices and should not be used. Such a cache coherency protocol may be administered by the provision of a snoop control unit (SCU) which monitors memory access requests issued by each of the processing devices and causes required actions to be taken by the processing devices.
To assist the SCU in determining which processing devices need be subjected to a coherency operation, it is known to provide a replicated address storage for access by the SCU, a replicated address storage being provided in association with each local cache of a processing device, each entry of the replicated address storage having a, predetermined associated cache record within the local cache and being arranged to replicate the address indication stored in that associated cache record. By reference to the relevant replicated address storage, the SCU can ascertain whether a processing device's local cache may be storing a copy of a data value which is the subject of an access by another processing device, and hence can determine whether a coherency operation needs to be invoked in relation to that local cache.
However, considering the earlier discussion of fault tolerance, and the handling of errors, it will appreciated that in such systems there are now two copies of the address information of the various cache records of each local cache, and errors can occur in either or both of these copies independently. In such systems, it is hence important that an error in one of the two copies is carefully handled so that the location giving rise to the error is not used again (which would face the system with potential multiple errors), and so that the SCU and a processing unit having a local cache always have the same coherent view of that cache.
U.S. Pat. No. 6,014,756 describes a high availability shared cache memory in a tightly coupled multi-processor system which provides an error self-recovery mechanism for errors in the shared cache. In particular, a shared cache is disclosed which is managed by means of a set associative cache directory. In case of an error in the shared cache, the relevant entry as well as its congruence class is invalidated in the cache directory and the requested data fetched from an upper memory before being reallocated in the local cache and the shared cache. Hard error support is provided by using periodic accesses to the cache error logging structure to check if errors occur often. In this case, a delete bit is set to prevent the corresponding entry from being used further.
The approach described in U.S. Pat. No. 6,014,756 is arranged for use in an inclusive cache system (where the contents of the level 1 cache are always stored in the level 2 cache), and if the errors are hard errors the approach described will not prevent those hard errors re-occurring until the delete bit mechanism is used.
It would be desirable to provide an efficient mechanism for handling errors occurring in a data processing apparatus having both a cache storage and a replicated address storage.