Enterprise storage systems currently available are proprietary storage appliances that integrate the storage controller functions and the storage media into the same physical unit. This centralized model makes it harder to independently scale the storage systems' capacity, performance and cost. Users can get tied to one expensive appliance without the flexibility of adapting it to different application requirements that may change over time. For small and medium scale enterprise, this may require huge upfront capital cost. For larger enterprise datacenters, new storage appliances are added as the storage capacity and performance requirements increase. These operate in silos and impose significant management overheads.
These enterprise storage systems may support data deduplication, or just deduplication, which refers to removing duplicate data blocks from the storage system, which reduces the space usage and hence, the cost of the system. There are many approaches to achieve deduplication: at the file-level or at the block-level, inline versus offline, etc. Single node deduplication systems are relatively easier to build because the metadata associated with deduplication is located in one place. Distributed deduplication across multiple storage system nodes is harder because the metadata may not be local. Data block deletion requires coordination with multiple nodes to make sure that there are no local and remote references to the data block that is being deleted. Current deduplication implementations build a complete deduplication index. That means, the fingerprint (FP) of the blocks are generated and indexed. Theses indexes are often very large. That would either require input/output (I/O) penalty for reading and verifying dupes or lots of memory to verify a key of a block exists in the storage system.