Information technology is changing rapidly and now forms an invisible layer that increasingly touches nearly every aspect of business and social life. An emerging computer model known as cloud computing addresses the explosive growth of Internet-connected devices, and complements the increasing presence of technology in today's world. Cloud computing is a model of service delivery for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service.
Cloud computing is massively scalable, provides a superior user experience, and is characterized by new, Internet-driven economics. In one perspective, cloud computing involves storage and execution of business data inside a cloud which is a mesh of interconnected data centers, computing units and storage systems spread across geographies.
A public storage cloud which stores customer data and is spanned across geographies commonly has a large number of redundant files across storage hubs in different locations and countries. In order to improve the efficiency of a cloud storage business, a vendor typically applies data deduplication to address the redundant data issue.
Data deduplication is a storage concept where redundant data is eliminated to significantly shrink storage requirements and improve bandwidth efficiency. In the deduplication process, duplicate data is deleted leaving only one copy of the data to be stored. This single copy is called a master copy and each deleted copy (referred to as a secondary copy) keeps a reference pointer which points to this master copy. Some data deduplication techniques deduplicate the data in a cloud spread across many storage hubs located in different data-centers across heterogeneous storage devices.
Deduplication can be accomplished using post deduplication and/or inline deduplication. In the case of post deduplication, there is no overhead of deduplication on in-band traffic. Data is stored on devices as it arrives without any concern for deduplication during this initial storing. A post deduplication daemon eventually runs sometime after the initial storing and scans devices for duplicate copies and attempts to remove redundant copies. In the case of inline deduplication, deduplication is done on in-band traffic, e.g., essentially in real time during the initial storing of the data. For example, for an incoming write request, a search is performed to determine whether the given data item already exists in the system. In the case that an already-existing copy (e.g., duplicate file) is found in the system, the write operation for the incoming write request is avoided and, instead, a data item pointer is created to point to the existing copy.
Deduplication can be performed at different levels of granularity within a computing environment, such as at the device level, storage pool level, and storage system level. At the device level, the scope of duplicate copy identification is limited to a single individual device. Storage pool level deduplication is applied to a collection of devices of a same type, which can be at a single storage pool or storage pools of homogeneous type. Storage system level deduplication applies to multiple storage device pools with devices of similar or heterogeneous type, with the scope of duplicate copies identification being at the overall system level.
Data deduplication techniques that address the redundant data issue by keeping a single master copy and deleting other redundant copies are not designed to intelligently select a storage drive on which to keep the master copy. Instead, such deduplication systems simply retain the master copy at the physical storage location where the first occurrence of one of the plural duplicate files was detected. If this location happens to be on relatively unreliable storage, then the master copy may later become unavailable due to hardware failure or other factors, causing disruption of data availability in the storage cloud.
For example, in device level deduplication, two copies of the same data may be stored respectively at two different sectors of a disk, e.g., Copy1 stored at an inner disk sector and Copy2 stored at an outer disk sector. In the case where the deduplication mechanism identifies Copy1 first, it will delete Copy2 and replace Copy2 with a pointer to Copy1. However, disk operation performance is usually higher on outer sectors of a disk compared to inner sectors. By saving the master copy (e.g., Copy1) on an inner sector, a user accessing Copy2 may suffer degradation in performance since they are actually accessing a file stored at an inner sector rather than a file stored at an outer sector.
As another example, at storage pool level deduplication, the deduplication mechanism does not consider the distribution of plural master copies across storage devices. By chance, one storage device can end up storing a disproportionately large number of master copies and become overloaded compared to other storage devices in the pool. Moreover, the deduplication mechanism does not consider the active health of the various available storage devices in the pool. As such, a master copy may be stored on a device with relatively bad health and that is likely to fail.
Storage system level deduplication can magnify the above problems associated with performance, load distribution, and health. Moreover, storage system level deduplication can suffer quality of service (QoS) issues. For example, a storage system may include a relatively low reliability first storage (e.g., a JBOD (Just a Bunch Of Disks) controller) and a relatively high reliability second storage (e.g., a RAID (Redundant Array of Independent Disks) controller). QoS requirements may mandate storage in a RAID controller. However, a deduplication mechanism that does not differentiate between the JBOD and RAID storage may save the master copy at the JBOD storage instead of the RAID storage. In such a case, the storage provider may not meet desired QoS levels and/or clients accessing a copy designated on RAID controller might suffer in terms of performance.