The present invention generally relates to memory devices for use with computers and other processing apparatuses. More particularly, this invention relates to nonvolatile-based (permanent memory-based) mass storage devices that use flash memory devices or any similar nonvolatile memory devices for permanent storage of data. The mass storage devices are characterized by allocating memory blocks for use to anticipate device failure before a critical threshold of endurance limitation is reached.
Mass storage devices such as advanced technology (ATA), small computer system interface (SCSI) drives, and USB 2.0, USB 3.0 or Gigabit Ethernet-based solid-state drives (SSD) are rapidly adopting nonvolatile memory technology such as flash memory or other emerging solid-state memory technology, including phase change memory (PCM), resistive random access memory (RRAM), magnetoresistive random access memory (MRAM), ferromagnetic random access memory (FRAM), organic memories, and nanotechnology-based storage media such as carbon nanofiber/nanotube-based substrates. Currently the most common technology uses NAND flash memory as inexpensive storage memory.
Endurance and data retention limitations that are inherent to the design and function of NAND flash technology are becoming increasingly more problematic for use of this technology in solid-state drives. Briefly, flash memory components store information in an array of floating-gate transistors, referred to as cells. NAND flash cells are organized in what are commonly referred to as pages, which in turn are organized in predetermined sections of the component referred to as memory blocks (or sectors). Each cell of a NAND flash memory component has a top gate (TG) and a floating gate (FG), the latter being sandwiched between the top gate and the channel of the cell. The floating gate is separated from the channel by a layer of tunnel oxide. Data are stored in a NAND flash cell in the form of a charge on the floating gate which, in turn, defines the channel properties of the NAND flash cell by either augmenting or opposing the charge of the top gate. This charge on the floating gate is achieved by applying a programming voltage to the top gate. The process of programming (writing 0's to) a NAND cell requires injection of electrons into the floating gate by quantum mechanical tunneling, whereas the process of erasing (writing 1's to) a NAND cell requires applying an erase voltage to the device substrate, which then pulls electrons from the floating gate. Programming and erasing NAND flash cells is an extremely harsh process utilizing electrical fields in excess of 10 million V/cm to move electrons through the tunnel oxide layer.
The brute force approach used to program and erase NAND flash results in wear and fatigue of the cells by causing atomic bond sites in the tunnel oxide layer to break. The broken-bond sites then become a trap for electrons that mimic charges in the floating gate, which can cause false data to be read from the NAND flash cells or prevent correct erasing of the cells. In the case of single level cells, where only one bit is stored per cell, the trapping of electrons is a relatively minor issue that gradually increases to a critical threshold over tens of thousands of program and erase (P/E) cycles. However, in the case of multilevel cells (MLC) that use, for example, four different levels to encode two bits per cell, the “drift” in charge caused by a steady build-up of electrons in the tunnel oxide layer and at the borders between the layers constitutes the predominant limitation of write endurance (which as used herein refers to the number of P/E cycles beyond which a solid-state memory device may become unreliable). Using 50 nm process technology as an example, MLC NAND flash memory is expected to sustain approximately 10,000 P/E cycles per cell before reaching the endurance limitation caused by degradation of the tunnel oxide layer. Data retention dramatically declines with every reduction in process geometry because of proximity effects, in particular, stress-induced leakage current (SILC), which refers to the release of electrons from the floating gate caused by erasure of a nearby block. For example, for a 3× nm process, typical write endurance is on the order of about 3000 to 5000 P/E cycles per cell, and for a 22 nm process write endurance estimates decrease toward about 900 to about 1200 P/E cycles per cell.
The life cycle of any solid-state drive is determined by its weakest component. Once individual blocks start to develop unrecoverable bit error rates (UBER) leading to data loss, the entire SSD has reached its end of life. In this context, one must consider that NAND flash is a form of memory and favors similar grouping of coherent data, known in DRAM and SRAM technology as locality of data. Consequently, in the absence of any additional management, flash memory would develop a few “high traffic” islands while the rest of the array would be underutilized. Both functional scenarios are far from optimal since high traffic areas are exposed to excessive wear and will, therefore, reach their endurance limitation ahead of the rest of the drive, whereas some very low traffic areas will never see a data update and, therefore, develop leakage current-based data retention loss.
In order to avoid the locality effect and resulting excessive wear of a small number of flash blocks, a technology called wear leveling has been implemented. Early generations of NAND flash-based solid-state drives used relatively primitive and unsophisticated mechanisms of wear-leveling based on regional schemes. As a result, a spread of usage of up to 20× between low usage and high usage blocks was common. Modern controllers use more sophisticated wear-leveling algorithms, with the result that differences between highest and lowest usage of blocks are often less than 0.5%. This number is expected to further decline with future generations of SSD controllers.
Modern SSDs also use a technique called “Over-Provisioning” (OP), in which the accessible amount of memory allowed by the controller is less than the physical amount of flash memory present in the array. For example, an SSD with 64 GB of physical memory can be over-provisioned to only allow 80% of its memory space to be used by the system and therefore appear as a 51 GB SSD. The over-provisioned 13 GB of memory is treated as reserve and will not be used for data storage. However, the blocks can be used for temporary storage and shuffled in and out of the OP pool on demand, as long as they are replaced immediately by empty blocks.
Another media management technique is bad-block management. Unlike the case of SDRAM, flash memory is not a “perfect” storage media but has bad blocks in every chip. Bad blocks are typically recognized by error checking and correction (ECC) mechanisms and flagged to be excluded from use for data storage. Another mechanism for integrity check is signature comparison. Bad-block management can also include blocks that are spontaneously failing as a function of wear.
One of the biggest challenges with SSDs is the prediction of device failure. In conventional hard disc drives, failure rates are additive, that is, there is a linear relation between number of program and erase accesses and the number of failures. As drives age, failures and bad blocks will increase accordingly. In the case of SSDs, the situation is different in that SSDs will function without failures up to a certain threshold followed by an exponential increase of failures over a relatively small increase in usage load. However, environmental factors, including usage patterns and temperature variations, also change the behavior of SSDs. Because of changes in chip design and even minor variability in quality, as well as the mentioned environmental factors contributing to the aging process of NAND flash, it is extremely difficult to predict the onset of the exponential increase in UBER and, by extension, predict the sudden death of an SSD.