Reliable backup of data is an important aspect of any computing system where loss of, or lack of access to, data would be detrimental to the system. For a backup system to be effective, at least one replica of the data should survive a failure, or data-destroying event, so that the data can be recovered and/or readily accessed. Such failures may happen as a result of catastrophic events (e.g., terrorist attacks and military actions), extreme weather phenomena (e.g., hurricanes and earthquakes), large-scale network correlated failures (e.g., routing protocol failures, DOS attacks causing congestion, and worms), viruses, power blackouts, power surges, and other similar events. To survive such events, data should be replicated on nodes that are unlikely to be affected by concurrent failures (i.e., failures affecting multiple system nodes simultaneously).
Adding to the problem is the fact that information technology systems today are much more interconnected and interdependent and, as a result, may more frequently be simultaneously impacted by the same failures. At the same time, the types of failures that can impact system data availability have also increased. In assessing overall system and data availability it is advantageous to be able to quantify the impact of multiple simultaneous failures, especially those that are traced to common events, i.e., are correlated. In order to minimize the impact of failures on data availability, several protection mechanisms, or combinations thereof, can be employed, including data replication, erasure codes, etc. Deployment and operation of these protection mechanisms incurs additional costs, such as software licensing, storage and networking hardware, communication bandwidth, additional computation cost, etc.
Currently employed solutions replicate data either on nodes that are geographically close to the source of the data (for example within the same LAN, data center, or building site) or on remote, geographically diverse sites. The use of replicas in close proximity to the data source results in low communication replication cost but does not provide the required geographic diversity to survive catastrophic failures that may affect a larger geographic area. Conversely, while replication on remote sites may provide higher resiliency to catastrophes, large distances between data storage locations result in high cost (such as equipment, infrastructure, and communication).
The term “distance,” with reference to node relationships, can refer to a conventional geographic separation between nodes, or to a more general definition of the relationship between nodes. This relationship encompasses factors such as compatibility and similarity between software, operating systems, networks, and more. Specifically, dissimilar operating systems are said to have a greater distance than similar operating systems. For instance, two nodes operating under a Windows operating system are more likely to suffer from the same system failure as would be a node operating under Windows and a second node operating under LINUX, with all other factors being equal.
Several theoretic solutions for increasing system availability, e.g., in the context of survivable storage systems, have been proposed. These include threshold schemes, such as Information Dispersal, Secret Sharing [A. Shamir, “How to share a secret”, Comm. ACM, Vol. 22, pp. 612-613, November 1979], Read-Solomon Codes [A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems”—J. S. Plank—Software Practice and Experience, Volume 27, Issue 9, Pages 995-1012 1997], and Tornado codes [John W. Nyers, Michael Luby and Michael Mitzenmacher, “Accessing Multiple Mirror Sites in Parallel: Using Tornado Codes to Speed up Downloads”, In proceedings of IEEE INFOCOM 1999, New York, N.Y.]. A common approach of these systems is to segment data into n pieces, of which any m can recover the data. By distributing the n pieces on different nodes, the system is able to survive failures of up to (n−m) nodes. Often the motivation for these systems is to survive denial of service (DOS) attacks, or intruders compromising individual systems. Typically these systems are designed assuming that each node can fail independently; this assumption underestimates the probability that multiple nodes will fail together and thus result in loss of data. Other known methods for providing failure resiliency also assume independent failures or ad-hoc schemes for preventing the impact of both independent and correlated failures. Among them are peer-to-peer systems [I. Stoic, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, “Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications”, In Proceedings of SIGCOMM 2001, San Diego, Calif. and S. Iyer, A. Rowstron and P. Drischel, “SQUIRREL: A Decentralized, Peer-to-Peer Web Cache”, PODCS 2002] that replicate content across multiple (peer) nodes. However, the peer selection is essentially randomized, without any consideration for the properties such as geographic distance, communication cost or delay between different nodes. The nodes where data replication is performed could be located very far away (e.g., across different countries or continents). So, while selection of a random set of nodes to replicate data using these methods could be used to preserve data in the event of catastrophic events, it is likely to incur very high communication costs and delays, and thus is not a dependably efficient method of replicating data.
Existing solutions for achieving data availability do not jointly consider resiliency and replication cost. What is needed is a solution that achieves desired levels of data availability in failure recovery while considering jointly the resiliency requirements and replication costs.