Flash memories contain one or more arrays of memory cells, which are organized into blocks. Each block is subdivided into pages. Data is written to flash memory in pages, i.e., the page is the smallest amount of memory that may be written at one time. Because the performance of flash memory degrades over time with repeated use, which may result in failure of the memory block and loss of data that had been stored there, it is desirable to identify blocks that are susceptible to future failure, before a failure and consequent loss of data occurs. One mode of failure of flash memory cells is the failure of the cell to be fully programmed during the write, i.e., the floating gate of the cell does not hold a correct amount of charge for the cell to be recognized as having a valid programmed value. This failure manifests itself as a cell having a programmed voltage that is not in the expected range, as illustrated in FIGS. 1 and 2.
FIG. 1 is a graph comparing the programmed values of good cells versus the programmed values of failing cells. The top graph is a histogram showing the measured VT values of a group of good cells (e.g., a page) that are programmed to be either a logic L, which is centered around voltage VL, or a logic H, which is centered around voltage VH. Curves 100 and 102 show the distribution of measured values for VL and VH, respectively. The middle graph is the same histogram, but showing the distribution of measured values for a group of failing cells A. In the middle graph, the failing cells' programmed values for VH, shown as curve 102′, are significantly shifted from the expected distribution of values shown by curve 102. As a result, some percentage of bits, indicated by hashed area 104, will fail to be programmed with a high enough voltage to be recognized as a logic H, e.g., these cells could not be programmed to a voltage higher than VT. The bottom graph shows a different kind of VT distribution change. In the bottom graph, curve 102″ shows that the distribution of programmed values for VH spans a wider range. This may indicate that not all cells in the group are aging in the same manner or at the same rate, e.g., some cells are still able to be programmed to a full VH value, while other cells are failing rapidly.
FIG. 2 is a graph comparing the programmed values of a group of good cells versus the programmed values of a group of failing cells, but for multi-level programmed cells. Unlike single-level cells, which have only target program values, VH and VL, the multi-level cells shown in FIG. 2 may be programmed to one of four distinct program values, shown as V00, V01, V10, and V11. Curves 200, 202, 204, and 206 show expected distributions of programmed voltages for a group of normal cells. In this example, measurement thresholds VT0 through VT3 are used to check for a valid programmed value for levels V00 through V11, respectively. Curves 202′, 204′, and 206′ illustrate example distributions of actual programmed values for a group of failing multi-level cells. As the actual programmed values shift, more and more bits register as program failures, as indicated by the hashed portions of the curves in FIG. 2. For example, curve 206′ shows an example distribution of actual programmed values for a group of cells that failed to be programmed to the intended voltage value of V11 and instead were programmed to a lower value. The hashed portion of curve 206′ shows that some of the cells' programmed voltages were less than the threshold value VT3. These cells will be read as having being programmed to V10 instead of the intended value of V11—a bit write failure.
One approach to dealing with failures of flash cells over time has been to perform erratic programming detection (EPD), in which, after a programming (i.e., writing data to) a page of flash memory, data is read from the just-written page and compared bit by bit (or cell by cell, in the case of multi-level cells, where each cell stores more than one bit of data) to data that is still present in the memory's write buffer. If the data that is read from the page matches the data that was written to the page, the EPD test returns a pass. For simplicity, the terms “failing bit” and “failing cell” will be used interchangeably, out of recognition that, even for multi-level cells, a programming failure will result in an incorrect bit value, e.g., a failing cell programmed to V11 may return V10, a failed bit, or V00, two failed bits.
If the number of mismatched bits is greater than zero, that is not necessarily a fail, however. Since the failure of at least some of the bits of a large page of flash memory are inevitable, each page typically contains an error correction code (ECC) value stored along with the data. An ECC is designed to be able to correct only a certain number of bit failures, typically much less than the number of bits of data being stored in the page. For example, where an ECC can correct up to 100 bits of failure and where the EPD test detects that 50 bits of the page were not programmed correctly, the EPD test may return a pass, since the number of failing bits can be recovered via error correction. If the EPD test detects that more bits were programmed incorrectly than can be corrected by the ECC, the EPD test may return a fail.
Conventional EPD uses a value called the bit scan program failure (BSPF) value as the threshold that in essence determines how many bits of the page may fail to be programmed correctly before the EPD test returns a fail. For example, a BSPF value of 50 means that the EPD will return a fail if more than 50 bits were not programmed correctly. When the system detects an EPD failure, this triggers a recovery process. When EPD reports that a write to a page failed, the system may decide to treat the block containing the page as unreliable, and may need to determine whether to replace the failing block or not.
There are problems with the conventional EPD method, however. One problem is that the pass/fail threshold value used by conventional EPD methods (i.e., the BSPF value) is static. For example, a conventional implementation of EPD may use a dedicated register to store the BSPF value to be used by the EPD process, but the value stored in the dedicated register is set during production testing and never changed thereafter. Conventional EPD implementations have no mechanism to dynamically or automatically adjust the BSPF value over the lifetime of the memory device.
One problem with a static measurement threshold value, such as the BSPF value in the example above, is that if the value is too high, sensitivity is too low (e.g., the system allows too many failures before setting the trigger). As a result, by the time EPD starts reporting that a page is failing, the block containing that page may already degraded to the point that data written to other pages within that block may have already failed, for example, resulting in unrecoverable data loss. If the BSPF value is too low, sensitivity is too high (e.g., the system reports an error even though the number of bits that failed to be programmed correctly is relatively low.) This may generate a large number of false positives, which in turn may cause otherwise healthy sections of the flash die to be marked as unusable. Thus, setting the BSPF value too low will slow the system down due to frequent interruption by the EPD fail status requiring recovery and special treatment.
Another disadvantage of conventional systems is that the threshold values used to determine whether or not a cell has been successfully programmed are also static. For example, the thresholds VT in FIG. 1 and thresholds VT0 through VT3 in FIG. 2 are typically fixed and not adjusted dynamically. As a result, there is no mechanism to distinguish between failure profiles, such as the two different failure distributions shown in FIG. 1. Likewise, there is no mechanism to accommodate cells that are aging, but in a slow and predictable manner. For example, slight adjustments to the values of VT0 through VT3 in FIG. 2 could potentially extend the life of a slowly degrading flash memory.
Another consequence of the static thresholds used by conventional systems, whether that threshold is a failing bit count or a voltage threshold, is that, without the ability to dynamically adjust these thresholds, it is difficult if not impossible to collect information about how the rate of failures is changing over time, such as the speed at which distribution 102′ is moving away from expected distribution 102, for example, or the speed at which distribution 102″ is spreading with respect to expected distribution 102. In some cases how the distributions are trending over time is a better indicator of potential failure than a single value, or even a single snapshot of values.
Regardless of how program failures are detected or measured, different NAND die, different usage models and different wear out levels will all show changing levels of failure rates. Using a static algorithm to address the different behaviors may negatively affect system performance (triggering too often may interfere with system operation without providing a reliable prediction mechanism) or system reliability (utilizing an algorithm that will replace blocks based on an inadequate algorithm may result the replacement of too many blocks until the storage device or system becomes “read only”.)
Accordingly, in light of these disadvantages associated with conventional, static algorithms for triggering EPD, there exists a need for early detection of potential flash failures using an adaptive system level algorithm based on flash program verify.