Vast amounts of electronic information are stored, communicated, and manipulated by modern computer systems. Much of this vast amount of electronic information is duplicated. For example, duplicate or near duplicate copies of data may be stored on a hard drive or hard drives, communicated across a communication channel, or processed using a computer or other electronic device. This duplicated data might be used in many different applications and on many different electronic systems. Accordingly, data de-duplication technology may impact a broad range of applications.
Data de-duplication is a method of reducing or eliminating redundant files, blocks of data, etc. In this way, a data de-duplication system attempts to ensure that only unique data is stored, transmitted, processed, etc. Data de-duplication is also sometimes referred to as capacity-optimized protection. Additionally, data de-duplication may address rapidly growing capacity needs by reducing electronic information storage capacity required, transmission capacity, processor capacity, etc.
In one example of how duplicate data might exist on a computer network, an employee may email a Word® attachment to 25 co-workers. On some systems, a copy is saved for every employee the file was sent to, increasing the capacity requirement of the file by a factor of 25. In some cases data de-duplication technology may eliminate the redundant files, replacing them with “pointers” to the original data after it has been confirmed that all copies are identical. This example illustrates data de-duplication at the file level. Data de-duplication may also be implemented based on variable size blocks of data. In other words, redundant variable sized blocks of data may be eliminated by replacing these blocks with a pointer to another instance of a matching block of data.
In some cases, data duplication might occur in a data storage system. For example, archived electronic information such as electronic documents, files, programs, etc. exist on backup tapes, backup hard drives, and other media. In many cases a computer may store a large number of files, which in some cases may be duplicates of the same file or document, slightly differing versions of the same document, etc. Accordingly, duplicates or near duplicates might exist for many different types of files, including documents, graphic files, and just about any other type of computer file.
Additionally, duplication might occur when data is communicated. In computer-based systems it is common for a computer to transmit one or more files over a computer network or other communication system to, for example, other computers in the computer network. This network may be wired, wireless, or some combination of the two. Additionally, the network may use just about any computer data communication system to transmit the data.
Different types of duplication might exist. In one type, a file or files may be repeatedly transmitted by a computer. For example, it is common for data transmitted during a backup operation to be almost identical to the data transmitted during the previous backup operation. Accordingly, a computer, computer networks, etc. might also repeatedly communicate the same or similar data.
In another type of duplication, a duplicate or near duplicate file or files, such as duplicate or near duplicate document, graphic files, etc. might be stored on a computer system. In other words, multiple copies of a file might exist, as in the emailed document example. Accordingly, different types of file de-duplication systems and methods might address various types of duplication. Some types of data de-duplication systems and methods might relate to file duplication or near duplication that involves multiple copies of the same or similar files sent during the same transmission. Other types of data de-duplication systems and methods may relate to file duplication that involves the same or similar files sent during a series of transmissions. Yet other types of data de-duplication might relate to both types of file duplication or near duplication.
Data de-duplication might include both transmission for backup and the backup itself. For example, some data de-duplication systems may transmit only data that has changed since a previous backup. This data might be stored on a daily basis or perhaps a weekly basis. In some systems these changes in the data might be what is saved, for example, on a backup drive, disc, tape, etc. For example, a backup system might initially transmit a “full backup” for example, all files in a directory or series of directories, all files on a disc or on a computer, all files on all disks on an entire network, etc. The full backup might simply be all files that a particular user selects for backup. The data for the full backup may be transmitted and stored using various communication and storage systems. After the full backup, subsequent backups might be based on only files that have changed. These might be the only files subsequently transmitted, stored or both. Of course, a user might also select to do a full backup from time to time after the initial full backup.
Systems that only make full backups might be required to store a large amount of data. This may increase the expenses associated with these types of systems due to, for example, the cost of additional hard drives, tape media, data CD's or DVD's, wear on disc drives, CD or DVD drives, tape drives, etc. Accordingly, incremental systems might be more efficient in terms of data storage, mechanical wear on system components, etc.
There are two main types of de-duplication. These methods are inline or offline. Inline de-duplication is performed by a device in the data path. This may reduce the disk capacity required to store electronic data thereby increasing cost savings. A disadvantage of inline de-duplication is that the data is processed while it is being transmitted for backup, which may slow down the backup process.
In contrast, offline data de-duplication does not perform the data de-duplication in the data path, but instead performs the process at the backup system. This may require more data storage capacity, such as, for example, disk capacity. Performance may, however, be improved by having the process reside outside of the data path, after the backup job is complete. In other words, because the data is processed after being transmitted for backup, the processing generally will not slow the transmission.
Some de-duplication systems use a process that detects if a block of data has already been saved and replaces the duplicate blocks with pointers to the originally saved block. Generally over time, as more and more data is placed into the de-duplicated storage, new blocks in files point to previously stored blocks. In some cases, the location of the original blocks will tend to migrate toward a random distribution given a sufficient amount of time and enough data. The randomness of the locations is commonly referred to as fragmentation; simply meaning at least some sequential blocks do not reside in sequential locations on the storage medium. Accordingly, an application sequentially reading from the beginning of the file, which has a high degree of de-duplication, is likely to suffer degradation in performance due to the amount of positioning required to retrieve the data in sequential order.
FIG. 1 is a diagram illustrating a data de-duplication system. The problem discussed above may be further illustrated with reference to the data de-duplication system of FIG. 1. In the example of FIG. 1, file A, is de-duplicated and stored in the storage pool 54, which in this example is a disk. The challenges, however, are not limited to a disk alone. Tape drives and other storage devices might also suffer performance degradation because of fragmentation. The blocks are stored sequentially and the pointers to the data for the file point to locations 1-6, 56. When file B is de-duplicated and stored in the storage pool 54, only two unique blocks are detected so the unique data is stored and the pointers to blocks originally stored in file A are used. The pointer to the block locations that make up file B are 7, 4, 8, 3, 2, 4, and 1, 56. (Note that block B, which is stored at disk location 4 occurs twice in file B.) As is illustrated in FIG. 1, reading sequentially through file B will require a position between each block read, thus degrading performance.