Solid State Drives (SSD) are increasingly being used as storage devices in storage systems due to the advantages they offer such as performance, size and power characteristics. However, they suffer from a limited lifetime because of the limited number of write cycles being possible before block failures start to occur. This limit to the lifetime is more apparent than with traditional hard disk drives. In response, some SSD manufacturers guarantee their drives only for a certain number of writes and some even ultimately slow I/O performance to achieve a specified lifetime within the limit of writes that the hardware can support.
This can lead to a new problem when this technology is used. If a number of SSDs are installed at the same time, then the more these SSDs are run in a balanced way for optimal performance, the more likely that they are to all reach the end of their limited lifetime at around the same time.
FIG. 1 shows a graph of an example percentage of blocks failing in a SSD plotted against the number of write (or Program/Erase) cycles that shows empirically the limited lifetime. Until around 100,000 Program/Erase cycles have been reached, there is a steady, but very low percentage of blocks failing. At around 100,000 Program/Erase cycles, the wear out mechanism starts to become apparent and the percentage of blocks failing starts to increase rapidly. After perhaps another 100,000 Program/Erase cycles, a substantial percentage of blocks are failing. Note that the horizontal scale of FIG. 1 is a logarithmic scale.
This limited lifetime leads to at least two potential problems:
1) If a large number of SSDs are installed at the same time, then a large number of SSD replacements may potentially be required over an unusually short time period in order to maintain the appropriate level of data protection. In a large data centre this may result in a lot of expense within a short time period of time and a lot of work within a short time period for administrators physically having to replace the drives.2) The effects of multiple SSDs reaching the end of their limited lifetime at the same time in one array is potential data loss. The example failure profile shown in FIG. 1 of an SSD disk increases the probability of concurrent failures when groups of storage devices are run in the ‘traditional’ balanced way used for hard disk drives.
U.S. Pat. No. 8,214,580 discloses a method for adjusting a drive life and capacity of an SSD by allocating a portion of the device as available memory and a portion as spare memory based on a desired drive life and a utilization. Increased drive life is achieved at the expense of reduced capacity.
U.S. Pat. No. 8,151,137 discloses a storage device having an unreliable block identification circuit and a partial failure indication circuit. Each of the plurality of memory blocks includes a plurality of memory cells that decrease in reliability over time as they are accessed. The unreliable block identification circuit is operable to determine that one or more of the plurality of memory blocks is unreliable, and the partial failure indication circuit is operable to disallow write access to the plurality of memory blocks upon determination that an insufficient number of the memory blocks remain reliable. Write access is removed from blocks of memory in order to allow continued read access to the data.
U.S. Pat. No. 8,010,738 discloses a technique for processing requests for a device. It receives a first value indicating an expected usage of the device prior to failure of the device, a second value indicating a specified lifetime of the device and determines a target rate of usage for the device. It determines a current rate of usage for the device, determines whether the current rate of usage is greater than the target rate of usage and if so, performs an action to reduce the current rate of usage for the device. If the device is part of a data storage system, upon determining that the current rate of usage is greater than the target rate of usage, an amount of a resource of a data storage system allocated for use in connection with write requests for the device is modified.