There are many factors to consider when designing a distributed data storage system. Modern distributed storage systems provide varying levels of data replication and can provide a high measure of protection against data loss in the event of hardware failure. Increasing the replication factor (RF) and reliability of the system can greatly increase the associated hardware and operating costs of a system. Various different mechanisms have been developed to reduce these costs while limiting the potential impact on competing requirements such as reliability and performance.
Traditionally, a data distribution algorithm is used to map data to one or more nodes of the storage system for storage. This mapping of the data can result in a single node sharing data with any number of other nodes, and can have a significant impact on the probability of data loss in the event of a hardware or software failure. Should a failure occur, the storage system may be required to restore and re-allocate at least some of the stored data from other nodes. Rebalancing of the data among the remaining nodes can have a significant performance impact.
It is the objective of designers to maintain performance, reliability, and cost constraints under normal operation conditions, as well as under high-load and failure scenarios.