1. Field of the Invention
This invention relates in general to information replication, and more particularly to selection of information replication locations based on a low probability of catastrophic concurrent failures and communication costs.
2. Description of the Related Art
Reliable backup of data is an important aspect of any computing system where loss of data would be detrimental to the system. For a backup system to be effective, at least one replica of the data should survive a failure, or data-destroying event, so that data can be recovered. Such failures may happen as a result of catastrophic events (such as terrorist attacks), extreme weather phenomena, large-scale network failures, power blackouts, and other similar events. To survive such events, data should be replicated on nodes that are unlikely to be affected by concurrent failures (i.e., failures affecting multiple system nodes simultaneously.)
Currently employed solutions replicate data either on nodes that are close to the data source (for example within the same LAN or building site) or on remote, geographically diverse sites. The use of replicas in close proximity to the data source results in low replication cost but does not provide the required geographic diversity to survive catastrophic failures that may affect an entire geographic area. Conversely, while replication on remote sites may provide higher resiliency to catastrophes, large distances between data storage locations results in high cost (such as equipment, infrastructure, and communication).
Recently, methods that replicate content across multiple nodes have been proposed, particularly in the context of peer-to-peer networks. A common characteristic in peer-to-peer based solutions is that they select a random set of nodes (peers) where the content is placed, without any consideration for the geographic distance, communication cost, or delay between these nodes. The nodes where data replication is performed could be located very far away (e.g., across countries or continents). So, while selection of a random set of nodes to replicate data using these methods could be used to survive catastrophic events, it is likely to incur very high communication costs and delays, and thus is not a dependably efficient method of replicating data.
Existing solutions for achieving data availability do not jointly consider resiliency and communication cost. Furthermore, none of these solutions consider the impact of multiple, concurrent failures which may be caused by catastrophic events. A new solution that addresses both of these issues is therefore required.
What is needed is a solution that achieves desired levels of data availability in disaster recovery while considering jointly the resiliency requirements and replication costs. Furthermore, a solution is needed that factors in an impact and probability of multiple, concurrent failures which may be caused by catastrophic events.