Information technology is changing rapidly and now forms an invisible layer that increasingly touches nearly every aspect of business and social life. An emerging computer model known as cloud computing addresses the explosive growth of Internet-connected devices, and complements the increasing presence of technology in today's world. Cloud computing is a model of service delivery for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service.
Cloud computing is massively scalable, provides a superior user experience, and is characterized by new, Internet-driven economics. In one perspective, cloud computing involves storage and execution of business data inside a cloud which is a mesh of inter-connected data centers, computing units and storage systems spread across geographies.
With the advent of cloud computing, concepts such as storage clouds have emerged. The storage clouds are a huge network of storage which can be shared by the customers without the need for the customer to manage the storage infrastructure. The storage cloud provider usually has a single large storage space and the provider keeps data from all its customers at the same place, which leads to the concept of multi-tenancy and a multitenant environment. Usually this storage space is shared by the entire customer base on that cloud.
Data deduplication comprises processes to eliminate redundant data. In a deduplication process, duplicate data is deleted leaving only one copy of the data to be stored. In certain situations, indexing of all data is still retained should that data ever be required. Deduplication is able to reduce the effective storage capacity because only unique data is stored. Data deduplication can generally operate at the file or the data block level. File level deduplication eliminates duplicate files, but this is not a very efficient means of deduplication. Block deduplication looks within a file and saves unique iterations of each block or bit. Each chunk of data is processed using a hash algorithm such as MD5 (Message-Digest Algorithm) or SHA-1 (secure hash algorithm). This process generates a unique number for each piece which is then stored in an index. When a file is updated, only the changed data is saved. That is, when only a few bytes of a document or presentation are changed, only the changed blocks or bytes are saved and the changes do not constitute an entirely new file. Therefore, block deduplication saves more storage space than file deduplication.
Many file systems and storage solutions provide the facility to mark documents and files as immutable, i.e., meaning the content of the files and/or the file itself cannot be deleted or modified for a given amount of time or until some other criterion is met. Typically, such requirements come from the compliance-governed agencies and industries, such as government agencies and the health care sector. Such agencies and industries commonly rely on the telecom industry to help ensure compliance with regulations like the Sarbanes-Oxley Act (SOX), Health Insurance Portability and Accountability Act (HIPAA), Federal Financial Institutions Examination Council (FFIEC), etc., which mandate immutable persistence of a given set of files.
For example, in HIPAA's Security Rule (e.g., the Technical Safeguard section), the security logs consisting of incidences are supposed to be preserved for six years in an immutable fashion. This indicates that any file marked immutable is of high importance or of critical value (at least for the given period of time) and hence it is vital to preserve it reliability. The telecom industries have to ensure compliance to these regulations by following the rules to maintain the communication records like the voice calls made and text messages sent. The telecom industries in turn exploit the immutable file feature from their infrastructure to deal with the record immutability requirements for these regulations. This feature is also supported in the IBM® General Parallel File System™ (GPFS™) which is a strategic clustered file system being used in many storage offerings and solutions. (IBM, General Parallel File System, and GPFS are trademarks of International Business Machines Corporation registered in many jurisdictions worldwide).
When data deduplication is done at the file level, the duplicate copies of the file are deleted and only single copy is maintained and all other references point to this single copy. However, this conflicts with the basic premise of immutability that the files cannot be edited or deleted. Thus, the deduplication process is not able to delete redundant copies of immutable files. For example, when there is a need to maintain immutable records of calls and text messages which typically involve more than one party, the telecom industries typically maintain plural copies of the same file in order to comply with the immutability requirements, even though this consumes extra data storage space and increases the management and data protection overhead. In a particular example of a conference call among ten participants in which the call record has a storage size of 1 GB, the telecom provider stores the same record for each participant and maintains immutability over all of the records, thereby consuming a total of 10 GB of space. If the files were not immutable, the deduplication process could delete nine copies of the file and maintain a single copy occupying just 1 GB of space reducing the effective used storage space by 9 GB. However, in some situations, the immutable property of the files prevents such deduplication.