Data deduplication, also known as data optimization is the act of reducing the physical amount of bytes of data which need to be stored on disk or transmitted across a network without compromising the fidelity or integrity of the original data. Data deduplication reduces the storage capacity needed to store data, and may therefore lead to savings in terms of storage hardware costs and data management costs. Data deduplication provides a solution for handling the rapid growth of digitally stored data.
Data deduplication may be performed according to one or more techniques to eliminate redundancy within and between persistently stored files. For instance, according to one technique, unique regions of data that appear multiple times in one or more files may be identified, and a single copy of those identified unique regions of data may be physically stored. References to those identified unique regions of data (also referred to as data “chunks”) may be stored to indicate the files, and the locations in the files, that include them. This technique is commonly referred to as single instancing. Compression of data may be performed in addition to single instancing. Other data reduction techniques may also be implemented as part of a data deduplication solution.
Optimized data in this specification refers to data that has been optimized, or deduplicated, by one or more data deduplication techniques such as single-instancing of chunks and compression. Optimized streams refer to streams that were deduplicated, or in other words, their data was optimized using data deduplication techniques.
Once data is optimized, the data may be accessed by reversing the effects of the optimization (i.e., de-optimizing the optimized data), for example by performing an inverse dedup operation and/or a decompression operation with respect to the optimized data. However, de-optimization causes a delay with respect to accessing the data. A greater amount of data results in a longer latency. Moreover, such latency may occur each time the data is accessed unless a de-optimized version of the data is stored for access on a storage device. Furthermore, de-optimization often consumes substantial resources (e.g., memory, central processing unit (CPU), disk I/O, etc.) of a device, which may negatively affect a main workload that is running on the device. Accordingly, frequent de-optimization may result in relatively inefficient utilization of the device's resources.
Storage virtualization is the act of abstracting logical storage from physical storage, such that data that is stored with respect to the physical storage may be accessed via the logical storage without regard to the structure of the physical storage. For example, a host device may execute one or more virtual machine instances. In accordance with this example, the host device may emulate disks to the virtual machine instance(s). The emulated disks are stored as files on the physical storage of the host device. In another example, a host device may execute a driver that presents a virtual disk to the operating system of the host device. In accordance with this example, the virtual disk is presented by mounting a file that is stored on the physical storage of the host device. Files that are mounted to emulate disks (i.e., to present virtual disks) are referred to as virtualized storage files.
Systems that utilize data optimization may be characterized by relatively low hardware (e.g., storage) costs and/or relatively low data management (e.g., backup) costs. However, when data optimization is performed in a virtualized environment, issues commonly arise. For example, if the data in a virtualized storage file is fully optimized, the latency that is associated with accessing the data may unduly degrade the performance of a host device, especially if regions of the virtualized storage file are frequently accessed. In another example, it may not be desirable to optimize some files that are stored in a virtual disk due to sensitivity of the files and/or the type of data that they include. In a non-virtualized environment, this issue may be resolved by applying a policy based on the files. However, in a virtualized environment, the files are located in a virtualized storage file and are therefore not visible to the host device. A host device typically has no way to know whether regions of a virtualized storage file are not optimizable.