Information technology (IT) organizations, both in the cloud and in enterprises, have to deal with an astonishing growth of data volume driven mostly by new generation applications and big-data use cases. Such growth pushes the scalability limits—in terms of both capacity and performance—of the most sophisticated storage platforms available. As such, enterprise storage systems use a number of technologies to reduce the footprint that data has on storage devices.
Data deduplication and cloning are two classes of technologies used to reduce the physical footprint of data. Data deduplication is a technique of eliminating duplicate copies of repeating data. Data deduplication is used to improve storage utilization. In data deduplication, unique chunks of data are identified and stored. Incoming chunks of data to be stored may be compared to stored chunks of data and if a match occurs, the incoming chunk is replaced with a small reference that points to the stored chunk (deduplicated data). Given that the same chunk of data may occur dozens, hundreds, or even thousands of times, the amount of data that must be stored can be greatly reduced.
Data deduplication and cloning allow a block of data on a physical device to be shared by more than one logical storage entities, such as files or volumes. Despite their similarities, they are usually considered as two separate approaches and are often designed and offered as completely separate features, even on the same storage platform, often not working well together. For online storage, deduplication is performed in a way that is transparent to the end user, typically as a best-effort background task. It is considered an approach appropriate for “cold” data. On the other hand, data cloning works well for “hot” data. However, it involves explicit management by the user and the effectiveness of sharing is reduced over time.
Another challenge with storage platforms relates to scalability and performance. A new generation of block-based storage systems aims at addressing this challenge, both for online use cases as well as for archival purposes. Each data object typically includes the data itself, a variable amount of metadata (attributes), and a globally unique identifier. It offers a simple read-write interface for data and metadata. In principle, these systems can offer unlimited scalability as clients can access in-parallel any number of data objects without having to go through a single data path funnel, as is the case with traditional network file systems.
However, conventional architectures make deduplication and cloning challenging. Existing systems with data space efficiency are either centralized or perform deduplication within individual devices or groups of devices. Such localized deduplication results in much lower space efficiencies. Conventional architectures do not efficiently integrate deduplication and cloning.