The present invention generally relates to memory devices for use with computers and other processing apparatuses. More particularly, this invention relates to a non-volatile or permanent memory-based mass storage device using flash memory devices or any similar non-volatile memory devices for permanent storage of data.
Mass storage devices such as advanced technology (ATA) or small computer system interface (SCSI) drives are rapidly adopting non-volatile solid-state memory technology such as flash memory or other emerging solid-state memory technology, including phase change memory (PCM), resistive random access memory (RRAM), magnetoresistive random access memory (MRAM), ferromagnetic random access memory (FRAM), organic memories, or nanotechnology-based storage media such as carbon nanofiber/nanotube-based substrates. Currently the most common technology uses NAND flash memory as inexpensive storage memory.
Despite all its advantages with respect to speed and price, flash memory-based mass storage devices have the drawback of limited endurance and data retention caused by the physical properties of the floating gate within each memory cell, the charge of which defines the bit contents of each cell. Typical endurance for multilevel cell NAND flash is currently on the order of 10,000 write cycles at 50 nm process technology and approximately 3000 write cycles at 4×nm process technology, and endurance is decreasing with every process node. Given the constant changes in process technology, process geometry and, further, inherent design differences from one manufacturer to another, it is very difficult to predict failures even under constant environmental conditions as they exist in the lab. In the field, temperature fluctuations add another layer of variables to the difficulties of predicting data loss.
Write endurance problems are typically detected during writing data to a block, that is, if the programming of the block fails, the controller can issue a re-write to a different location on the array and flag the block as non-functional. Some additional complications come into play in this case as, for example, the “erratic behavior of write endurance fails,” meaning that often a block fails after a given number of writes, for example after 5,000 cycles, but then recovers full functionality for another 5,000 cycles without additional failures.
From a data management standpoint, more problematic is the question of data retention. Even though flash memory is considered non-volatile, the memory cells do not have unlimited data retention since the data are stored in the form of a charge on the floating gate. Over time, these charges will dissipate regardless of how good the insulation through the tunnel oxide layer is. The leakage current responsible for the loss of data depends on several factors, primarily temperature and time. In this context the general term temperature encompasses absolute temperature, temperature changes both with respect to values and time, as well as peak and mean temperature parameters. Each design and process technology will react somewhat differently to exposure to these parameters, which increases the difficulty of assessing current leakage and, by extension, estimating the progression in loss of data. Additional contributing factors include near-field effects such as write disturbance to adjacent cells or read access to the same or different cells, generally referred to as read disturbances.
In view of the above, it should be apparent that there are no simple methods for modeling the behavior of any given cell within an array of NAND flash memory based on assumed environmental and usage patterns. On the system level, more complex algorithms might be able to approximate reliable failure prediction. However, because of the mismatch between data written from the host to the device and data written from the device controller to the non-volatile memory array, commonly referred to as write amplification, only the drive itself has reliable information about the number of program and erase cycles that are not accessible by the system. Because of these issues, sudden failures in the form of data loss can occur. In the easiest case, these failures are simple or multiple bit errors that are correctable through ECC algorithms such as Reed-Solomon (RS) or Bose-Ray-Chaudhuri-Hochquenghem (BCH) error correction. However, a more severe problem is the “sudden death” of a drive that can occur if critical data are lost, for example, in the file system or if the bit error rate exceeds the number of correctable errors. In either case, these failures are not correctable through ECC algorithms.