The disclosed embodiments relate in general to data protection schemes in storage systems. More specifically, the disclosed embodiments relate to optimizing the selection and application of appropriate levels of data protection in an electrically rewritable nonvolatile storage system by dynamically adjusting a data scrub frequency and/or an error correcting code (ECC) scheme applied to the storage system.
Flash memory, which has no moving parts, is a type of nonvolatile storage system that can be erased and reprogrammed in units of memory often referred to as pages or blocks. Flash memory gets its name because entire sections of memory can be erased in a single action or “flash. An array of flash memories can transfer data to and from solid state drives (SSDs) much faster than electromechanical disk drives.
Because it lacks moving parts, flash memory technology is well suited for embarked systems such as airplane applications. However in such an environment, devices might stay powered off for an extended period of time. This can become a problem with flash memory due to limits on its ability to retain data in high temperature environments or over extended periods of time without power. Thus, the rate at which bit errors within flash memory cells increase is a function of the system temperature and the age of the flash block measured in program/erase (P/E) cycles.
Known approaches to addressing bit errors within flash memory cells include the flash storage controller periodically issuing a memory “scrub” read to the flash cells that are estimated to be at risk. A relatively simple background task may be run on the flash controller's CPU and iterated over all valid RAID (redundant array of independent disks) stripes issuing a flash read to each page. RAID is a way of storing the same data in different places (thus, redundantly) on multiple hard disks. By placing data on multiple disks, I/O (input/output) operations can overlap in a balanced way, improving performance. Using the aforementioned background task, the flash controller periodically scrubs and checks all user data for bit errors no matter how frequently or infrequently the host application accessed the data. If a system is powered off and shelved for an extended period of time, the flash controller itself would have no concept of how urgently a scrub of all physical data was needed.
ECC is another known approach to addressing bit errors. However, the effectiveness of ECC schemes depends heavily on how much correction data is stored. Also, applying more robust ECC schemes can reduce overall system performance. Therefore, the appropriate ECC scheme/level for a particular flash based storage subsystem is heavily dependent on the type of flash and the application needs. In the case of an embarked system designed to sit idle for a long period of time in a hot environment (e.g., in a desert), performance must be sacrificed to allow for a more robust level of ECC scheme. Hence, known ECC algorithms require a design tradeoff between high recovery capacity and reduced overall I/O performance.