Vast amounts of electronic information are stored, communicated, and manipulated by modem computer systems. Much of this vast amount of electronic information is duplicated. For example, duplicate or near duplicate copies of data may be stored on a hard drive or hard drives, communicated across a communication channel, or processed using a computer or other electronic device. This duplicated data might be used in many different applications and on many different electronic systems. Accordingly, data de-duplication technology may impact a broad range of applications.
Data de-duplication is a method of reducing or eliminating redundant files, blocks of data, etc. In this way, a data de-duplication system attempts to ensure that only unique data is stored, transmitted, processed, etc. Data de-duplication is also sometimes referred to as capacity optimized protection. Additionally, data de-duplication may address rapidly growing capacity needs by reducing electronic information storage capacity required, transmission capacity, processor capacity, etc.
In one example of how duplicate data might exist on a computer network, an employee may email a Word® attachment to 25 co-workers. On some systems, a copy is saved for every employee the file was sent to, increasing the capacity requirement of the file by a factor of 25. In some cases data de-duplication technology may eliminate the redundant files, replacing them with “pointers” to the original data after it has been confirmed that all copies are identical. This example illustrates data de-duplication at the file level. Data de-duplication may also be implemented based on variable size blocks of data. In other words, redundant variable sized blocks of data may be eliminated by replacing these blocks with a pointer to another instance of a matching block of data.
In some cases, data duplication might occur in a data storage system. For example, archived electronic information such as electronic documents, files, programs, etc. exist on backup tapes, backup hard drives, and other media. In many cases a computer may store a large number of files, which in some cases may be duplicates of the same file or document, slightly differing versions of the same document, etc. Accordingly, duplicates or near duplicates might exist for many different types of files, including documents, graphic files, and just about any other type of computer file.
Additionally, duplication might occur when data is communicated. In computer-based systems it is common for a computer to transmit one or more files over a computer network or other communication system to, for example, other computers in the computer network. This network may be wired, wireless, or some combination of the two. Additionally, the network may use just about any computer data communication system to transmit the data.
Different types of duplication might exist. In one type, a file or files may be repeatedly transmitted by a computer. For example, it is common for data transmitted during a backup operation to be almost identical to the data transmitted during the previous backup operation. Accordingly, a computer, computer networks, etc. might also repeatedly communicate the same or similar data.
In another type of duplication, a duplicate or near duplicate file or files, such as duplicate or near duplicate document, graphic files, etc. might be stored on a computer system. In other words, multiple copies of a file might exist, as in the emailed document example. Accordingly, different types of file de-duplication systems and methods might address various types of duplication. Some types of data de-duplication systems and methods might relate to file duplication or near duplication that involves multiple copies of the same or similar files sent during the same transmission. Other types of data de-duplication systems and methods may relate to file duplication that involves the same or similar files sent during a series of transmissions. Yet other types of data de-duplication might relate to both types of file duplication or near duplication.
Data de-duplication might include both transmission for backup and the backup itself. For example, some data de-duplication systems may transmit only data that has changed since a previous backup. This data might be stored on a daily basis or perhaps a weekly basis. In some systems these changes in the data might be what is saved, for example, on a backup drive, disc, tape, etc. For example, a backup system might initially transmit a “full backup” for example, all files in a directory or series of directories, all files on a disc or on a computer, all files on all disks on an entire network, etc. The full backup might simply be any files that a particular user selects for backup. The data for the full backup may be transmitted and stored using various communication and storage systems. After the full backup, subsequent backups might be based on only files that have changed. These might be the only files subsequently transmitted, stored or both. Of course, a user might also select to do a full backup from time to time after the initial full backup.
Systems that only make full backups might be required to store a large amount of data. This may increase the expenses associated with these types of systems due to, for example, the cost of additional hard drives, tape media, data CD's or DVD's, wear on disc drives, CD or DVD drives, tape drives, etc. Accordingly, incremental systems might be more efficient in terms of data storage, mechanical wear on system components, etc.
There are two main types of de-duplication. These methods are inline or offline. Inline de-duplication is performed by a device in the data path. This may reduce the disk capacity required to store electronic data thereby increasing cost savings. A disadvantage of inline de-duplication is that the data is processed while it is being transmitted for backup, which may slow down the backup process.
In contrast, offline data de-duplication does not perform the data de-duplication in the data path, but instead performs the process at the backup system. This may require more data storage capacity, such as, for example, disk capacity. Performance may, however, be improved by having the process reside outside of the data path, after the backup job is complete. In other words, because the data is processed after being transmitted for backup it generally will not slow the transmission of data down.
When data de-duplication technology is used to eliminate redundant sets of data the data de-duplication storage systems might typically store a single copy of data or portions of data, and then create references to these objects as the same data is encountered again. By using references to previously stored data, systems built upon these object stores can de-duplicate new data as it arrives to be stored. For example, a file system can present original files, but only retain the unique portions of data used to compose those files, with references substituted for duplicate occurrences.
Over time, however, the object store may contain unreferenced objects. In other words, over time every reference to a data object stored might be deleted such that the data object stored is no longer needed. For example, when all references to an object have been deleted there is no longer a need for a valid representation of the data object stored. The data comprising the object may continue to be stored, however. Unreferenced objects, data objects that continue to be stored when all references to the object have been deleted, still occupy storage space, and it is desirable to reclaim that space for use by new data objects.
It is generally undesirable to simply delete a data object whenever a reference to that object is deleted. The reference currently being deleted might not be the only reference to the object. Accordingly, the data object may still be needed, for example, if these other references are accessed. In some systems, deleting a data object whenever something that points to it is deleted may be prohibited because there may still be other active references to the object. These additional references may not be known by the process currently deleting its own use of the object, which further complicates the problem.
In a system that never or rarely deletes objects with multiple references, the storage device(s) may become so full with unreferenced data objects that large portions of data storage are unnecessarily consumed by unused data objects. Further, a system that does not delete any of the objects as the references change or are removed until a user removes the objects may become unusable until the manual intervention occurs. Accordingly, a system that lets the object store become populated and is manually emptied may become so full of unused data objects that it is not usable. While a storage device may from time to time become full, if some or all of this data is data that a user wishes to store, then the system is generally performing its function. If, on the other hand, most or a large percentage of the storage device is filled with unreferenced data objects, then the user will generally not have access to a large percentage of the storage space purchased.