Enterprises as well as individuals are becoming increasingly dependent on computers. As more and more data are generated, the need for efficient and reliable data storage systems, as well as for increased amounts of available storage, is increasing. There are a variety of systems in existence today, utilizing both local and network storage for processes such as primary and backup data storage. For example, during a backup operation, the data from the data sources are copied to the backup devices. Commonly, there is a substantial amount of data from each of the data sources that remains the same between two consecutive backups, and sometimes there are several copies of the same data. Similarly, in primary storage systems, different applications, or users of the same application, might be storing the same file, such as an email message or an operating system related file. These and other types of systems can have more available space and be more efficient if unchanged data, or data for which a duplicate already exists, is not written, or rewritten, to the storage device.
Various approaches have been used to reduce the amount of information that is stored (referred to herein as “data reduction”, “space saving”, and/or “space reclamation”), including techniques such as data compression, file deduplication, block deduplication, and delta block optimization (also referred to as delta-based deduplication). Data compression techniques are essentially transparent to applications and storage hardware and use well-known algorithms to reduce the size of files by eliminating redundant bits. Compression is capable of providing, depending on data type, significant reductions in data, as is well understood in the art. There can still be multiple copies of a given file, however, even if all of the copies are compressed, because compression cannot detect or reduce multiple copies of the same file.
Delta block optimization reduces the amount of data backed up and stored from a source by attempting to determine which blocks are new and then writing only the blocks that are changed to backup. This technique does not, however, reduce multiple copies of the same file that can be written to storage by different users.
File-level deduplication removes multiple copies of the same file, where duplicate files are replaced with a pointer to the unique version of the file. Typically, file-level duplication is not effective at removing files that have minor changes compared with previous versions of the file.
Block deduplication (whether fixed or variable block size) eliminates redundant or duplicate data by retaining just one unique instance or copy of blocks or chunks of data. Redundant data is replaced with a pointer to the block of a unique data copy. The amount of aggregate data deduplication is at least partially dependent on data type. Block deduplication can become even more efficient as identical data is stored multiple times on the storage device.
A downside to both file and block-level deduplication methods is their effect on read and write performance, especially when used in connection with primary data storage; block-level deduplication has less of an effect on backup data storage, because backup data storage generally is not read as often as is primary data storage. For example, streams of bytes that have not been subject to space saving techniques like deduplication or data compression generally are stored sequentially in chunks, making it relatively efficient and straightforward to read or write to any given address anywhere within the stored stream of bytes. In contrast, with space saving techniques such as deduplication, the stream of bytes might not be in the same order, but may be stored in several different places such that a computer system may have to traverse several different locations before it actually can read data.
For example, with conventional deduplication methods (whether in-line or post process), during an application write to a deduplicated logical device, there is a need to find an unallocated space to write to; a need to update the map of the location where the data is stored; and possibly a need to update counts (e.g., the stored metadata pointers) on a shared block, any of which can add an appreciable latency (time) and overhead to the write process. Similarly, with conventional deduplication methods (whether in-line or post process), for an application read to a deduplicated logical device, one issue is that the location of the block(s) being read must be looked up in metadata. Also, for sequential reads, the data blocks that have been deduplicated likely are no longer stored in sequence; that is, the application read must restore a file from blocks that may be pointed to in potentially many different places, also adding appreciable latency and overhead, so as to reconstitute the file to the form expected by the application.