The present invention relates generally to the electrical, electronic and computer arts, and, more particularly, to computerized methods and related systems for increasing data storage reliability in storage systems.
Modern data storage systems are extremely large and consist of several tens or hundreds of storage nodes. In such systems, node failures occur regularly (e.g., one or more times daily), and safeguarding data from such failures poses a serious design challenge.
Data redundancy, for example in the form of replication or advanced erasure codes, is often used to protect data from node failures. By storing redundant data across several nodes, the redundant data on surviving nodes can be used to rebuild the data lost by the failed nodes.
Modern data storage systems include heterogeneous storage systems, which comprise storage devices of different types. As an example, a hybrid storage system is a system that utilizes both solid-state drives (SSDs) and hard disk drives (HDDs) as storage media for persistent storage. That is, the solid-state drive is not used as a cache; rather, it is used at the same level of the memory hierarchy as the HDDs. Typically, arrays of multiple SSDs and arrays of multiple HDDs are used to form redundancy groups to achieve higher performance and reliability, using a redundant array of independent disks/drives (RAID) scheme or any other scheme.
One also knows various metrics to characterize hardware failures. For example, the mean time to data loss (MTTDL) metric gives the average time before a loss of data happens in a given array, where an array, e.g., a RAID array, joins two or more hard disks so that they make a logical disk. As another example, one also uses the expected annual fraction of data loss (EAFDL) as another metric. Such metrics pertain to one type of device.
One usually tries to improve the data storage reliability in each storage device or for each type of storage device.