1. Technical Field
The present invention relates in general to the field of computer file backup, and more particularly to a method of and system for deduplicating data backed up in a client server environment.
2. Description of the Related Art
Computer data is a vital asset to most businesses and other organizations. Data may become lost through equipment failures, human errors, or by many other causes. Loss of data can severely impact the operations and success of an organization.
Most organizations attempt to reduce the likelihood of losing data by periodically backing up their data by transferring copies of the data to another system. The backup system may be at a different physical location from that of the original data. Many backup facilities receive data from several physically separate systems over networks. If an organization loses data locally it can have the lost data restored from the backup system.
Backup systems typically have a tremendous amount of data. Backup systems desire to reduce the amount of space required to store the data. One method of reducing required storage space is data deduplication. In data deduplication, a data object, which may be a file, a data stream, or some other form of data, is broken down into one or more chunks using a chunking method. A hash is calculated for each chunk using any of several known hashing techniques. The hashes of all chunks are compared for duplicates. Duplicate hashes mean either the data chunks are identical or there has been a hash collision. A hash collision occurs when different chunks produce the same hash. To prevent hash collisions, other techniques such as bit-by-bit comparison may be performed. After the comparison of hashes and proof of their uniqueness, unique chunks are stored. Chunks that are duplicates of already stored chunks are not stored; rather, such chunks are referenced by pointers to the already stored chunks.
Data deduplication can yield storage space reductions of 20:1 or more. However, the deduplication ratio is highly dependent upon the method used to chunk the data. Several chunking techniques have been developed. Each chunking method is thought to be optimum for a set of file types. However, a particular chunking method may not in fact be optimum for a particular file type. Data deduplication consumes a fair amount of processing power and time. Since backup systems have so much data to deduplicate, it is important to deduplicate the data as efficiently as possible.