This invention relates generally to data storage technologies, in particular, to distributed fault-tolerant storage in independent storage locations.
Modern Internet-scale applications, such as social networking websites, receive and generate vast quantities of data continuously. This data includes user information, images, videos, text posts, emails, performance logs, search indices, meta data, etc. This data must be stored securely and reliably, and it must be accessible despite data disruption events such as natural disasters, power failures, disk failures, server failures, etc. In the past, reliability and accessibility of data in Internet applications was provided by storing many copies of the same data in geographically separate data centers. By having distinct, separate copies of the same data in multiple locations, a system could ensure that at least one copy of the data was accessible at any time, despite the occurrence of data disruption events.
But data mirroring has a cost associated with it. Each copy of data requires additional storage resources, and if multiple copies of the same data are maintained, the storage overhead becomes prohibitive for large data sets. One solution to this problem is to not maintain full redundant copies of the data, but rather to compute smaller recovery codes from the data, where the recovery codes allow a lost piece of the data to be recovered using the remaining data. In the simplest case a recovery code can be generated by splitting the data into N pieces and computing an XOR across the pieces. The N pieces of data can then be distributed to N separate data storage locations. If any one of the N pieces is lost, the lost piece can be reconstructed by XORing the recovery code against the remaining pieces. In this simple case the storage scheme requires 1/ N of the data as additional storage overhead to maintain the recovery codes, but this is still an improvement over the complete data duplication required in data mirroring. The simple scheme guards against the loss of only a single one of the N pieces of data, however, other methods of generating recovery codes allow for greater redundancy, but may require additional storage overhead as a tradeoff. The data storage locations can be established in geographically separate sites so that the probability of a single data disruption event effecting all locations is minimized.
Systems that provide redundant storage as described above are sometimes called Reliable Arrays of Independent Nodes (RAIN). RAIN systems are often efficient in terms of the storage overhead that they require to provide data redundancy, but they are inefficient in terms of the network usage. When a piece of data is lost at one of the nodes of a RAIN system due to a data disruption event (e.g., hard disk failure), the information to reconstruct that lost data must be fetched from other nodes since all the recovery codes and the other data pieces will not be locally stored. The RAIN system cannot keep all the recovery codes and other data pieces locally because doing so would adversely affect the fault-tolerance characteristics of the system—the failure of a single machine or location could cause the system to lose access to all the locally stored data. Therefore, when data recovery is necessary, both the recovery codes and data pieces necessary for data reconstruction must be sent over the network to the location where the lost data is being reconstructed.
Depending on the frequency and severity of data disruption events, the network traffic initiated by data reconstruction processes may cause network congestion and other issues. For extremely large data sets, such as those generated by Internet scale applications—e.g. social networking systems, search engines, web services providers, etc.—handling the traffic between data storage locations may be very expensive.