The present invention relates generally to error detection, and more specifically, to a dynamic cache row fail accumulation due to catastrophic failure.
A cache memory is a component that transparently retains data elements (or simply data) so that future requests for any retained data can be served faster. A data element that is stored within a cache is associated with a pre-defined storage location within a computer system. Such data element might be a value that has recently been computed or a duplicate copy of the same storage location that are also stored elsewhere. If requested data is contained in the cache, this is a cache hit, and this request can be served by reading the cache, which is comparatively faster than reading from a storage location since the cache is usually built close to its requester. Otherwise, if the data is not contained in the cache, this is a cache miss, and the data has to be fetched from a storage system medium not necessarily close to the requester, and thus is comparatively slower.
In a cache, electrical or magnetic interference inside a computer system can cause a single bit of embedded dynamic random access memory (eDRAM) to spontaneously flip to the opposite state. This can change the content of one or more memory cells or interfere with the circuitry used to read/write them. Also, the circuitry of the cache may fail, and this can change the content of one or more memory cells.
To ensure the integrity of data stored in a data processing system and transmitted between various parts of the system, various error detection and correction schemes have been employed. An error can be a correctable error (CE) or an uncorrectable error (UE). Schemes, such as the Hamming code, can allow for double error detection and single error correction. Typically, before a data word is stored in memory, check bits are generated over the data bits and stored with the data word. When the data word is retrieved from memory, a check is made over the data and the check bits to detect, and if necessary, to correct identifiable bits. In checking the data word and check bits received from memory, a syndrome is generated for each parity group of a multiple byte data word. A matrix, referred to as an H-matrix, may be generated which defines all of the syndromes for which a single error is correctable and which identifies each bit position of the data word which is correctable. When a syndrome is generated which matches the data in one of the columns of the matrix, the bit to be corrected is identified from the matrix and the polarity of the identified bit is changed to correct the data error. Additional tests need to be made to determine whether there are uncorrectable errors. When dealing with 64-bit data words, the H-matrix has 64 columns, plus columns for check bits. The number of syndromes which may be generated and which do not fall within the matrix are considerably larger than the correctable-error syndromes included in the matrix. A typical error correction scheme using 8-bit syndromes for 64 bits of data, and requiring single error correction and double error detection, will have 256 possible syndromes and 72 syndromes associated with correctable errors. The detection of the presence of a correctable error and the presence of uncorrectable errors requires large amounts of detection circuitry.