1. Field of Invention The techniques described herein are directed generally to the field of computer storage, and more particularly to techniques for performing data de-duplication in a computer storage environment.
2. Description of the Related Art
Backup systems exist that access data from one or more data sources in a computer system and write the data to a backup storage environment wherein the data is stored on one or more backup storage media. In this manner, if it is desired to retrieve any of the data that existed on the computer system at the time the backup was made (e.g., to restore the computer system data to a known good state, in response to a crash of the computer system that results in a loss of data, etc.), the data can be retrieved from the backup storage system.
It has been recognized that in a backup system, there often is redundancy (also referred to as duplication) between data that is being backed up to the backup system and other data that was previously backed up and already is stored on the backup system. For example, depending upon how the backup system is configured, weekly full backups of a computer system may be performed, and from one week to the next, only a small percentage (e.g., 5%) of the data stored in the computer system may be changed, with a large percentage (e.g., 95%) remaining unchanged. Thus, if two full backup operations are performed in back-to-back weeks, a large percentage (e.g., 95%) of the data stored to the backup storage environment during the second backup operation may be redundant, as the data is already securely stored on the backup storage environment. This redundancy may be compounded each time a new backup operation is performed on the data set. For example, using the example of weekly backups, over the course of a year, fifty-two copies of some data may be stored to the backup storage system.
Redundancy in data stored to a backup storage system can also result in other ways. For example, if an e-mail system is being backup up and there are numerous e-mails with the same attachment, backing up all of the e-mails may result in backing up the attachment multiple times. As another example, even when a logical object of data (e.g., a file) is modified in the period of time between two different backup operations, it may often be the case that only a small portion of the data in the logical object is modified. For example, for a relatively large file, if only a small number of bytes are modified, the majority of the bytes in the file may remain unchanged, such that there is redundancy and duplication for the unchanged bytes if they are backed up multiple times.
In view of the foregoing, data de-duplication processes have been developed for backup storage systems. The purpose of a conventional de-duplication process is to identify when a backup process is seeking to backup data that has already been stored on the backup system, and to refrain from backing up the data again to avoid duplication of the data stored on the backup system. This reduces the storage resources used by the backup storage system and results in a cost saving.
A conventional de-duplication system 1000 is illustrated in FIG. 1. A backup application 1001 provides to a parsing unit 1003 a backup data stream 1002 of data to be backed up. The parsing unit 1003 removes and stores (to backup storage 1009) metadata in the backup data stream 1002 that the backup application 1001 inserts along with the data being backed up to enable the backup data to be stored and retrieved by the backup application 1001. The parsing unit 1003 provides a raw data stream of backup data 1004 (absent the metadata) to be backed up to a chunking unit 1005. The purpose for the chunking unit 1005 is to divide the raw stream of backup data 1004 into a number of discrete chunks (also referred to as blocks) of data. The sizes of the chunks or blocks may vary, but many chunking units 1005 produce blocks of data that are smaller than the size of a conventional logical object (e.g., a file) being backed up so that redundancy in the data in sub-portions of the logical object can be detected.
The data blocks or chunks output from the chunking unit 1005 are provided to a hashing unit 1007. The hashing unit 1007 performs a number of functions as shown in blocks 1007a-d. Initially, in block 1007a, the hashing unit selects an individual chunk to be operated upon, and passes the selected chunk to a hashing function 1007b which performs a hash operation on the chunk to generate an object identifier (also referred to as a content address) for the chunk. The hashing unit 1007b applies a hashing algorithm that seeks to generate distinct identifiers for chunks of data that differ in any respect, but generates the same identifier for chunks of data that are identical. Once a hash for a chunk is generated, a determination is made, as shown at block 1007c, of whether the chunk is unique. This determination typically is made by accessing a lookup table that is maintained by the hashing unit 1007 and includes the content addresses for all of the chunks of data previously stored on the backup storage system 1009. If the content address for the chunk of data is already stored in the lookup table, it signifies that the chunk is already stored on the backup storage environment and therefore is not unique. In that circumstance, the data chunk need not be stored to the backup storage environment again, so that the hashing unit 1007 merely stores a pointer to where the chunk of data is stored, and then returns to block 1007a wherein the next chunk is selected for processing. Conversely, when it is determined by the hashing unit at block 1007c that the chunk is unique, a write is issued at block 1007d to the backup storage system 1009 to store the new chunk of data thereon.