Distributed storage systems built from commodity computer systems can deliver high performance, availability, and scalability for new data-intensive applications at a fraction of the cost compared to monolithic disk arrays. Data is replicated across multiple instances of the distributed storage system at different geographical locations, thereby increasing availability and reducing network distance from clients.
Because of the overhead of managing many individual files for small objects, some systems aggregate multiple objects together in an aggregated store. When files are aggregated, however, there is a need for compaction, which has to be performed periodically in order to reclaim space occupied by deleted objects. Compactions are costly because they involve re-writing each of the objects stored in an aggregated store. Due to these costs, compactions are typically performed in the background, preferably at times when the use of resources is low. As a consequence, it is preferable for each storage cluster to perform its own independent scheduling of compaction.
On the other hand, the requirements for high availability and low latency of object access require that aggregated stores be replicated, typically across two or more distinct geographic locations. Different replicas of the same store are normally compacted at different times, and therefore the replicas end up in different states after compaction. This breaks bitwise identicality of replicas, which increases the complexity and cost of cross-replica integrity checks.
Finally, the process of identifying deleted objects (i.e., garbage collection) is challenging. In some implementations, garbage collection joins the list of known live objects with the directories of all data stores in order to find stored objects that are no longer live. This requires opening all of the stores, which is costly. For the majority of the data stores, the storage savings that could be realized by compaction is outweighed by compaction costs. In fact, it is typical for deleted objects to be distributed non-uniformly across the data stores.