Computer clusters, or groups of linked computers, have been widely used to improve performance over that provided by a single computer, especially in extended computations, for example, involving simulations of complex physical phenomena, etc. Conventionally, in a computer cluster, computer nodes (also referred to herein as client nodes, or data generating entities) are linked by a high speed network which permits the sharing of the computer resources and memory.
Data transfers to or from the computer nodes are performed through the high speed network and are managed by additional computer devices, also referred to as File Servers. The File Servers file data from multiple computer nodes and assign a unique location for each fragment of a file in the overall File System. Typically, the data migrates from the File Servers to be stored on rotating media such as, for example, common disk drives arranged in storage disk arrays for storage and retrieval of large amount of data. Arrays of solid-state storage devices, such as Flash Memory, Phase Change Memory, Memristors, or other Non-Volatile Memory (NVM) storage units, are also broadly used in data storage systems.
Failures of computer components in the data storage systems usually follow the patterns depicted in FIG. 1 as a curve known as the “bathtub curve”, which represents the life (or the lifespan) of a device. The first phase of the life, identified in FIG. 1 as an “Early Failures” region of the “bathtub” curve, generally has a high failure rate due to what is known as an infant mortality of the components. These are the failures which generally occur due to manufacturing defects that appear early in the life of the device when in use.
The middle phase, identified in FIG. 1 as a “Random Failures” region of the “bathtub” curve, represents the useful life of the device, and generally has a low failure rate, which typically occurs due to random defects.
The final phase of life identified in FIG. 1 as a “Wear-Out Failures” region of the “bathtub” curve, shows higher failure rates as the components fail due to wear-outs.
Current Non-Volatile Memory (NVM) devices, such as Flash, Memristor and Phase-Change Memory, have a limited life span and wear out with “write” usage. For example, typical Multi-Level Flash cells have approximately 7K-10K program/erase cycles before the cell cease to accurately retain data. Memristors and Phase-Change Memory devices have higher program/erase cycles, but are word addressable and typically are much faster, and are prone to wear-outs with “write” usage.
In order to prevent data loss, manufacturers of NVM devices implement various wear leveling algorithms inside the devices to ensure that the cells in an NVM device wear evenly across the entire NVM device. The known program/erase limit of the cells in the NVM devices in conjunction with the wear leveling algorithms serves as a tool permitting the lifespan of the NVM devices to be accurately predicted. The predicted lifespan of the NVM devices is reflected by a spike at the end of the failure rate curve for NVM devices shown in FIG. 2. While other components of data storage systems show a slowly increasing failure rate as the device begins to wear out, NVM devices fail much more abruptly after a predetermined amount of usage.
The lifespan of an NVM device is measured in program/erase cycles. The total number of program/erase cycles a device is capable of withstanding is the typical program/erase limit for a cell in the NVM device multiplied by the total number of cells in the device. FIG. 3 is representative of a relationship between a parameter known as a “remaining health” of an NVM device and the “age” of the device, i.e., the amount of program/erase cycles the device has gone through. The “remaining health” is expressed as an percent (integer) ranging from 0% to 100%, where “0%” corresponds to a state of a device having no remaining health (close to or at the failure state), and “100%” corresponds to a state of a fully healthy device. As can be seen in FIG. 3, the remaining health of an NVM device is high when the device is new, and decreases from 100% (for a new NVM device) to 0% (at the NVM device's End-of-Life stage).
The lifespan of a device is related to the number of bytes written to the device. One properly bounded “write” of an entire cell is equivalent to a single program/erase cycle. Non-bounded “writes” usually cause multiple program/erase cycles. This effect is known as write amplification which is an undesirable phenomenon associated with flash and solid-state drives. In write amplification, the actual amount of physical information written in the drive is a multiple of the logical amount intended to be written.
Wear leveling inside the NVM device causes write amplification because the NVM device transports data between cells causing additional program/erase cycles which shorten the life of the NVM device.
The multiplying effect increases the number of “writes” over the lifespan of the NVM devices which shortens the time the NVM device can reliably operate.
Knowing the lifespan of an NVM device makes the End-of-Life (EOL) of the NVM device easily predictable. Manufacturers of NVM devices use the EOL predictions to warn a user ahead of time that an NVM device of interest is approaching the End-of-Life stage.
As shown in FIG. 4, the End-of the-Warranty precedes the real End-of-Life of a device. The manufacturers usually report the End-of the-Warranty of the device as the End-of-Life. This is provided in order to warn a user beforehand, and to allow the user sufficient time to replace the device in question before the data loss occurs. Manufacturers often report the End-of-Life of a device when only less than 50% of the total allowed number of program/erase cycles have been reached.
The probability of data loss increases in NVM devices with the number of program/erase cycles performed to a cell (media) in the NVM device. Thus, by limiting the EOL to the End-of the-Warranty, manufacturers of NVM devices protect the integrity of the data at the cost of a reduced lifespan of NVM devices.
The probability of data loss may be reduced by using RAID algorithms to distribute data with parity across multiple devices. The main concept of the RAID is the ability to virtualize multiple drives (or other storage devices) in a single drive representation. A number of RAID schemes have evolved, each designed on the principles of aggregated storage space and data redundancy.
Most of the RAID schemes employ an error protection scheme called “parity” which is a widely used method in information technology to provide tolerance in a given set of data.
The data RAID algorithms typically stripe the data across N+P number of devices, where N is the number of chunks of data in the RAID stripe, and P corresponds to the parity data computed for the data channels in the RAID stripe.
For example, in the RAID-5 data structure, data is striped across the drives, with a dedicated parity block for each stripe. The parity blocks are computed by running the XOR comparison of each block of data in the stripe. Parity is responsible for data fault tolerance. In operation, if one disk fails, a new drive can be put in its place, and the RAID controller can rebuild the data automatically using the parity data.
Alternatively to the RAID-5 data structure, the RAID-6 scheme uses block-level striping with double distributed parity P1+P2, and thus provides fault tolerance from two drive failures. A system configured with the RAID-6 protection may continue to operate with up to two failed drives. This allows larger RAID groups to be more practical, especially for high availability systems.
With the introduction of de-clustered RAIDs organizations, virtual disks are dynamically created out of a large pool of available drives (members), where the number of members is larger than N+P, with the intent that a RAID rebuild will involve a large number of cooperating drives, and thus reduce the window of vulnerability for data loss. An added benefit is that random READ/WRITE I/O (Input/Output) performance is also improved.
A stripe of RAID data may be created by selecting chunks of free space on separate devices included in a pool. Members of the RAID stripe may be selected algorithmically or randomly depending on the sophistication of the de-clustered RAID algorithm. The goal of using a de-clustered RAID algorithm is to distribute the data across the devices evenly for maximum performance and latency.
Parity de-clustering for continued separation in redundant disk arrays has advanced the operation of data storage systems. The principles of parity de-clustering are known to those skilled in the art and presented, for example, in Edward K. Lee, et al., “Petal: Distributed Virtual Disks”, published in the Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, 1996; and Mark Holland, et al., “Parity De-clustering for Continuous Operation in Redundant Disk Arrays”, published in Proceedings of the Fifth Conference on Architectural Support for Programming Languages and Operating Systems, 1992.
In the data storage systems, the NVM devices have very predicable lifespan, and it has been found that multiple devices are likely to fail after a similar number of writes. Referring to FIG. 5, when all NVM devices in a data storage system have similar I/O workloads during their useful lives, all NVM devices are prone to failure substantially at the same time span, thus making the system vulnerable to data loss.
In the known de-clustered RAID protected data storage systems, members of a parity data stripe are chosen at random when a large chunk of data is requested to be written, i.e., these systems randomly select M+P number of data storage devices out of N data storage devices in the pool, where M corresponds to the number of data units, P corresponds to the number of parity units, and N>M+P.
If the RAID algorithm distributes the data evenly to all the memory devices randomly selected for the parity stripe, there is a high probability of multiple NVM devices failure in the parity stripe within the same time period, and thus the likelihood of potential data loss is very high.
It would be highly desirable to provide a data migration system protected with the de-clustered RAID algorithm where the simultaneous (or close to simultaneous) failures of multiple NVM devices are prevented.