1. Field of the Disclosure
The present disclosure relates generally to integrated circuits and, more particularly, to memory arrays in integrated circuits.
2. Description of the Related Art
In most static random access memory (SRAM) architectures, all of the SRAM cells corresponding to a selected row are written or read out together. Some specific SRAM implementations may selectively write to and read from a subset of the cells in the selected row. Errors occur when values of bits read from one or more of the SRAM cells do not correspond to the values that were intended to be stored in the SRAM cells. For example, an error occurs when a value of “1” is written to an SRAM cell, but a value of “0” is returned when the SRAM cell is subsequently read. This type of error is referred to as a “stuck-at-0” error if the error persists for an extended period. For another example, an error occurs when a value of “0” is written to an SRAM cell, but a value of “1” is returned when the SRAM cell is read. This type of error is referred to as a “stuck-at-1” error if the error persists for an extended period of time.
The errors may be characterized as soft errors or hard errors. Soft errors are intermittent errors that can be corrected by re-writing the faulty SRAM cell. Hard errors persist even after the faulty SRAM cell has been re-written. Hard errors therefore also are referred to as persistent errors or permanent errors. SRAM arrays are susceptible to hard errors that are produced during manufacturing or arise during the life cycle of the product. Some of the hard errors produced during manufacturing may be detected during a memory built-in self-test (MBIST) but other manufacturing errors, as well as errors that arise during the lifecycle of the product, may only be manifested at run time. To illustrate, if an SRAM implemented on a system-on-a-chip (SOC) is operated below its minimum voltage for reliable operation, or if random or environmental conditions affect the state of the SRAM cell, some of the cells in the SRAM may stop functioning correctly and may therefore produce errors even after the faulty SRAM cells have been re-written. The error may persist for an extended period of time. Hard errors of this sort can only be detected at runtime because they depend on the particular environmental conditions present when the SRAM array is being read or written.
Hard or soft errors can be detected using parity bits or error correction code (ECCs) that are stored when information such as a word (e.g., four bytes of data) or a group of bits is written to the SRAM array. For example, a parity bit may be stored along with the data bits in the SRAM array or stored somewhere outside the array in association with the SRAM array. The value of the parity bit may be compared to a parity value computed using a word read from a corresponding SRAM array. The same type of parity (either odd or even parity) is used for both the storage and parity generation logic based on the read out data bits. An error is detected when the stored parity value read out from the array does not match the parity value computed based on the read out data bits from the SRAM cells. Other techniques for detecting errors in the SRAM array include scrubbing and duplicating the SRAM array for comparison to the original SRAM array on each access.
Data may be re-written to the faulty SRAM cells in response to detecting an error, which may correct soft errors. However, hard errors cannot be corrected by re-writing the faulty SRAM cells. Instead, conventional techniques for detecting and correcting soft errors in SRAM cells may cause the processing device to continuously re-write the faulty SRAM cells without ever correcting the hard error and may even deleteriously affect the functionality of the processing device. To recover from a hard error, the processing device using the SRAM must be flushed and restarted after the faulty SRAM cells have been replaced. For example, if a memory built-in self-test (MBIST) detects a hard error in an SRAM, the row or column that includes the faulty SRAM cell may be replaced using redundant rows or columns. For another example, sub-blocks of SRAM cells that include the faulty SRAM cells may be replaced by mapping the indices of the faulty sub-blocks to spare sub-blocks in the SRAM.
Conventional approaches to error detection and correction in SRAM arrays have a number of drawbacks, particularly when implemented in high-performance computing systems that may need to run continuously for long periods of time without interruption. For example, scientific computing projects such as DNA sequencing or climate studies may require continuously operating a processing device for months at a time or even longer. Flushing the state of the processing device to physically replace or repair the SRAM in response to detecting a hard error may cause a significant amount of work to be lost, potentially costing the user a significant amount of time and money. One alternative is to add redundant rows or columns to correct hard errors, but this approach may consume a large amount of area on the processing device. Another alternative is to replace a faulty sub-block by mapping it to another sub-block in the SRAM. However, setting aside portions of the SRAM to replace faulty sub-blocks may degrade the performance of the SRAM, e.g., by reducing the amount of memory available in the SRAM.
Furthermore, conventional approaches do not distinguish between activated and deactivated errors. An activated error is an error that can have a functional, power, or performance impact on the processing device. If an activated error is not detected, it can cause functional damage to the processes being performed by the processing device. A de-activated error is an error that may not have a significant functional, power, or performance impact on the processing device. For example, errors in predictor structures may be classified as de-activated errors because the undetected single bit errors may decrease the accuracy of the prediction but are unlikely to cause a functional impact on the corresponding process. Conventional replacement techniques assume that all detected errors are permanent activated errors and consequently do not allow for reuse or de-allocation of the resources reserved for replacement of faulty portions of the SRAM. For example, a faulty row or column is typically replaced with a redundant row or column by blowing an appropriate set of fuses so that the replacement cannot be undone.