Data deduplication, also known as data optimization, is the act of reducing the physical amount of bytes of data which need to be stored on disk or transmitted across a network without compromising the fidelity or integrity of the original data. Data deduplication reduces the storage capacity needed to store data, and may therefore lead to savings in terms of storage hardware costs and data management costs. Data deduplication provides a solution to handling the rapid growth of digitally stored data.
Data deduplication may be performed according to one or more techniques to eliminate redundancy within and between persistently stored files. For instance, according to one technique, unique regions of data that appear multiple times in one or more files may be identified, and a single copy of those identified unique regions of data may be physically stored. References to those identified unique regions of data (also referred to as data “chunks”) may be stored that indicate the files, and the locations in the files, that include them. This technique is commonly referred to as single instancing. Compression of data may be performed in addition to single instancing. Other data reduction techniques may also be implemented as part of a data deduplication solution.
Difficulties exist in managing data stored according to data de-duplication techniques. For example, due the data fragmentation imposed by data de-duplication, latency may exist in accessing files stored according to de-duplication. This latency limits the adoption of data deduplication solutions, especially on primary storage data, where users expect seamless, fast access to files. Furthermore, data deduplication algorithms may run on a dedicated appliance or on the device that stores and serves data (e.g., a file server). In the case of a file server, data deduplication may not be the primary function of the device, and thus data deduplication techniques may need to be efficient so as not to over consume device resources (e.g., memory, input/output (I/O) mechanisms, central processing unit (CPU) capacity, etc.). Still further, because the quantity of digital data is growing at a very high rate, the size of storage devices (e.g., storage disks) and the total storage capacity associated with computing devices has to grow, causing difficulties with data deduplication techniques that do not scale well with increasing amounts of storage.