The present invention relates generally to the field of computer data storage, and more particularly to deduplication in primary storage.
Data deduplication refers to detecting and eliminating redundant data in storage systems without affecting the accuracy and integrity of the original data. Deduplication reduces the amount of physical storage requirements of a system and can reduce the amount of data transmitted across a network.
Deduplication can result in significant hardware savings by avoiding the cost of additional storage capacity, reducing or eliminating power consumption by additional storage devices, and removing the cost of additional data management. Deduplication applies to the redundancy of data to be written to storage, and the redundancy ratios for data, such as backup files and email, which can reach very high ratios of duplication.
Deduplication can be performed in two differing modes, inline and off-line. Inline deduplication refers to deduplication processes in which the data is deduplicated as it is received into primary memory and before it is written to disk, as opposed to off-line (also called out-of-line or post-process) deduplication.
Deduplication segments larger blocks of data to be written to storage, into smaller units of data referred to as chunks. A “continuous write” can be a familiar unit of data such as a file, an image, a database table or an email, and is comprised of multiple data chunks. Chunks, usually range in size from 4 KB to 512 KB, corresponding to a logical block address (LBA), which identifies a location in primary storage. A “write” is an operation of storing data to an address within primary or secondary storage. A write to an LBA at a specific time (T1), assigns a chunk of data to the corresponding address of the LBA in primary memory. Similarly a write to secondary or physical storage stores the chunk of data to a corresponding address in a secondary storage device. Therefore each continuous write can correspond to multiple LBAs, and each LBA also corresponds to a physical block address (PBA) by use of an LBA-to-PBA (L2P) mapping index. The PBA identifies a location in secondary storage where the data for the LBA is written to a memory storage device, such as a disk drive of a computing device. A hashing function is performed on the content of the chunk to produce a near-unique fingerprint that is compared to an index of previously stored fingerprint-to-PBA mappings (F2P), to determine if the LBA is a duplicate. The steps of hashing and lookup require significant primary memory and central processing unit (CPU) cycles that can result in unacceptable CPU performance degradation.
Inline deduplication avoids the need to retain a large storage capacity prior to deduplication. However, it places significant demands on primary memory for duplicate lookups, and overall computing performance can be significantly affected by calculating fingerprints to identify duplications for chunks of data awaiting a write operation. Reducing storage requirements by inline deduplication on all data writes comes at a cost and as a result, many implementations perform deduplication “off-line”.
Off-line deduplication, in which the data is first written to storage in a disk storage area and during time periods when CPU demands are low, is performed in a batch mode, avoiding unscheduled performance issues. However, the reduced storage benefits of deduplication are not fully realized as large storage areas are still required to hold the written data until deduplication is completed, and many high-utilization systems lack off-line time when deduplication can be performed without impact.
Deduplication policies, defined at the system level, set conditions and priorities for deduplication benefits to be realized. Policies are set based on characteristics of the data, but generally trade off reduction of storage requirements for improved performance.