1. Field of the Invention
The present invention relates to systems and methods for managing defects in a digital data storage system. More particularly, the invention relates to systems and methods for early failure detection on memory devices such as Flash EEPROM devices.
2. Description of the Related Art
Computer systems typically include magnetic disk drives for the mass storage of data. Although magnetic disk drives are relatively inexpensive, they are bulky and contain high-precision mechanical parts. As a consequence, magnetic disk drives are prone to reliability problems, and as such are treated with a high level of care. In addition, magnetic disk drives consume significant quantities of power. These disadvantages limit the size and portability of computer systems that use magnetic disks, as well as their overall durability.
As demand has grown for computer devices that provide large amounts of storage capacity along with durability, reliability, and easy portability, attention has turned to solid-state memory as an alternative or supplement to magnetic disk drives. Solid-state storage devices, such as those employing Dynamic Random Access Memory (DRAM) and Static Random Access Memory (SRAM), require lower power and are more durable than magnetic disk drives, but are also more expensive and are volatile, requiring constant power to maintain their memory. As a result, DRAM and SRAM devices are typically utilized in computer systems as temporary storage in addition to magnetic disk drives.
Another type of solid-state storage device is a Flash EEPROM device (hereinafter referred to as flash memory). Flash memory exhibits the advantages of DRAM and SRAM, while also providing the benefit of being non-volatile, which is to say that a flash memory device retains the data stored in its memory even in the absence of a power source. For this reason, for many applications, it is desirable to replace conventional magnetic disk drives in computer systems with flash memory devices.
One characteristic of some forms of non-volatile solid-state memory is that storage locations that already hold data are typically erased before they re-written. Thus, a write operation to such a memory location is in fact an erase/write operation, also known as an erase/write cycle. This characteristic stands in contrast to magnetic storage media in which the act of re-writing to a location automatically writes over whatever data was originally stored in the location, with no need for an explicit erase operation.
Another characteristic of some forms of non-volatile solid-state memory is that repeated erase/write operations can cause the physical medium of the memory to deteriorate, as, for example, due to Time-Dependent-Dielectric-Breakdown (TDDB). Because of this characteristic deterioration, non-volatile solid-state storage systems can typically execute a finite number of erase/write operations in a given storage location before developing a defect in the storage location. One method for managing operation of a data storage system in the face of these defects is the practice of setting aside a quantity of alternate storage locations to replace storage locations that become defective. Such alternate storage locations are known as spare storage locations or “spares” locations. Thus, when a storage location defect is detected during a write operation, the data that was intended for storage in the now-defective location can be written instead to a “spares” location, and future operations intended for the now-defective location can be re-directed to the new spares location. With this method of defect recovery, as long as a sufficient number of spares locations have been set aside to accommodate the defects that occur, the system may continue to operate without interruption in spite of the occurrence of defects.
When a defect occurs and no free spares locations remain to serve as alternate data storage locations, the storage system can fail. Endurance is a term used to denote the cumulative number of erase/write cycles before a device fails. Reprogrammable non-volatile memories, such as flash memory, have a failure rate associated with endurance that is best represented by a classical “bathtub curve.” In other words, if the failure rate is drawn as a curve that changes over the lifetime of a memory device, the curve will resemble a bathtub shape. The bathtub curve can be broken down into three segments: a short, initially high, but steeply decreasing segment, sometimes called the “infant mortality phase” during which failures caused by manufacturing defects appear early in the life of a device and quickly decrease in frequency; a long, flat, low segment that represents the normal operating life of a memory device with few failures; and a short, steeply increasing segment, sometimes called the “wear-out phase,” when stress caused by cumulative erase/write cycles increasingly causes failures to occur. Thus, towards the end of a device's life span, deterioration can occur rapidly.
Often, when a storage system fails, the data contained in the storage system is partially or completely lost. In applications where a high value is placed on continued data integrity, storage systems prone to such data loss may not be acceptable, in spite of any other advantages that they may offer. For instance, a high degree of data integrity is desirable in a data storage systems that is used in a router to hold copies of the router's configuration table, which can grow to massive size for a very large router. A high degree of data integrity is also desirable in data storage systems used to hold temporary copies of the data being transferred through a router. In this instance, ensuring a high level of data integrity is complicated by the fact that a very high number of erase/write operations are executed during the operation of such an application.
A challenge faced by reliability engineers is how to monitor a device's ability to cope with defects and to predict a device's failure so that data loss due to unanticipated system failures does not occur.