Data de-duplication is increasingly being adopted to reduce the data footprint of backup and archival storage, and more recently has become available for near-line primary storage controllers. Scale-out file systems are increasingly diminishing the silos between primary and archival storage by applying de-duplication to unified petabyte-scale data repositories spanning heterogeneous storage hardware. Cloud providers are also actively evaluating de-duplication for their heterogeneous commodity storage infrastructures and ever-changing customer workloads.
While the cost of data de-duplication in terms of time spent on de-duplicating and reconstructing data is reasonably well understood, the impact of de-duplication on data reliability may not be as well known (e.g., especially in large-scale storage systems with heterogeneous hardware). Since traditional de-duplication keeps only a single instance of redundant data, such an approach magnify the negative impact of losing a data chunk in chunk-based de-duplication that divides a file into multiple chunks, or of missing a file in de-duplication using delta encoding that stores the differences among files. Administrators and system architects have found understanding the data reliability of systems under de-duplication to be important but challenging.