The present invention relates to recovering failed devices in distributed data centers, and more particularly, this invention relates to recovering failed devices in cloud storage systems and networks.
Reliable delivery of data is an essential aspect of data storage systems. Error detection and correction schemes are used to detect errors in data delivery and reconstruct data when an error is detected. Error detection and correction schemes are especially important for delivery of data over unreliable communication channels and/or channels which are subject to noise. Data redundancy schemes, such as parity computation, enable reliable delivery of data using error detection and correction techniques by adding redundancy and/or extra data to a message. Redundancy and/or extra data may be used to check the message for consistency with the original message.
Erasure codes with high efficiency and high loss tolerance are important for data storage systems where there are a variety of loss mechanisms. In a storage cluster comprising multiple availability zones (AZs), it is important to protect against the loss of an AZ. In large scale data storage (e.g., in cloud storage) there is a high probability that, at any point in time, there are some component losses under repair.
Multiple devices may fail concurrently in distributed data centers where each device is under one or more unique fail mode scenarios. A variety of failures may occur including a loss of a complete data center, a loss of a box of individual storage devices in a data center, loss of individual storage devices within a box, and/or sectors within a storage device. Existing solutions do not account for all the various failures modes and/or require a low rate of data devices with respect to the total number of devices in the system.