Vast amounts of electronic information are stored, communicated, and manipulated by modem computer systems. Much of this vast amount of electronic information is duplicated. For example, duplicate or near duplicate copies of data may be stored on a hard drive or hard drives, communicated across a communication channel, or processed using a computer or other electronic device. This duplicated data might be used in many different applications and on many different electronic systems. Accordingly, data de-duplication technology may impact a broad range of applications.
Data de-duplication is a method of reducing or eliminating redundant files, blocks of data, etc. In this way, a data de-duplication system attempts to ensure that only unique data is stored, transmitted, processed, etc. Data de-duplication is also sometimes referred to as capacity optimized protection. Additionally, data de-duplication may address rapidly growing capacity needs by reducing electronic information storage capacity required, transmission capacity, processor capacity, etc.
In one example of how duplicate data might exist on a computer network, an employee may email a Word® attachment to 25 co-workers. On some systems, a copy is saved for every employee the file was sent to, increasing the capacity requirement of the file by a factor of 25. In some cases data de-duplication technology may eliminate the redundant files, replacing them with “pointers” to the original data after it has been confirmed that all copies are identical. This example illustrates data de-duplication at the file level. Data de-duplication may also be implemented based on variable size blocks of data. In other words, redundant variable sized blocks of data may be eliminated by replacing these blocks with a pointer to another instance of a matching block of data.
In some cases, data duplication might occur in a data storage system. For example, archived electronic information such as electronic documents, files, programs, etc. exist on backup tapes, backup hard drives, and other media. In many cases a computer may store a large number files, which in some cases may be duplicates of the same file or document, slightly differing versions of the same document, etc. Accordingly, duplicates or near duplicates might exist for many different types of files, including documents, graphic files, and just about any other type of computer file.
Additionally, duplication might occur when data is communicated. In computer-based systems it is common for a computer to transmit one or more files over a computer network or other communication system to, for example, other computers in the computer network. This network may be wired, wireless, or some combination of the two. Additionally, the network may use just about any computer data communication system to transmit the data.
Different types of duplication might exist. In one type, a file or files may be repeatedly transmitted by a computer. For example, it is common for data transmitted during a backup operation to be almost identical to the data transmitted during the previous backup operation. Accordingly, a computer, computer networks, etc. might also repeatedly communicate the same or similar data.
In another type of duplication, a duplicate or near duplicate file or files, such as duplicate or near duplicate document, graphic files, etc. might be stored on a computer system. In other words, multiple copies of a file might exist, as in the emailed document example. Accordingly, different types of file de-duplication systems and methods might address various types of duplication. Some types of data de-duplication systems and methods might relate to file duplication or near duplication that involves multiple copies of the same or similar files sent during the same transmission. Other types of data de-duplication systems and methods may relate to file duplication that involves the same or similar files sent during a series of transmissions. Yet other types of data de-duplication might relate to both types of file duplication or near duplication.
Data de-duplication might include both transmission for backup and the backup itself. For example, some data de-duplication systems may transmit only data that has changed since a previous backup. This data might be stored on a daily basis or perhaps a weekly basis. In some systems these changes in the data might be what is saved, for example, on a backup drive, disc, tape, etc. For example, a backup system might initially transmit a “full backup” for example, all files in a directory or series of directories, all files on a disc or on a computer, all files on all disks on an entire network, etc. The full backup might simply be any and all files that a particular user selects for backup. The data for the full backup may be transmitted and stored using various communication and storage systems. After the full backup, subsequent backups might be based on only files that have changed. These might be the only files subsequently transmitted, stored or both. Of course, a user might also select to do a full backup from time to time after the initial full backup.
Systems that only make full backups might be required to store a large amount of data. This may increase the expenses associated with these types of systems due to, for example, the cost of additional hard drives, tape media, data CD's or DVD's, wear on disc drives, CD or DVD drives, tape drives, etc. Accordingly, incremental systems might be more efficient in terms of data storage, mechanical wear on system components, etc.
In some cases, duplicate data might also be processed in other ways by a computer system, a network of computers, etc. For example, the systems and methods described herein might not only be applied to data storage devices, but to data transmission devices or any other data processing devices that deal with blocks of data that might be redundant. For example, in data mining and information filtering applications, duplicate or near duplicate files might be processed by the data mining or information filtering applications. In another example, an enterprise software applications might receive data from a wide variety of sources. These sources might vary widely in terms of formatting, quality control, or other factors that may impact the consistency or reliability of the data. As a result, the database may contain duplicative or erroneous data. In many cases this data may need to be “cleaned.”
“Data cleaning,” or “data clean-up,” generally refers to the handling of missing data or identifying data integrity violations. “Dirty data” generally refers to input data records or to particular data fields in a string of data comprising a full data record. For example, as discussed above, anomalies may exist because data might not conform in terms of content, format, or some other standard established for the database. This dirty data many need to be analyzed.
One example where dirty data may need to be analyzed involves credit card transactions processing. Transactions may contain electronic information that includes data in predetermined fields. These predetermined fields might contain specific information, such as, for example, transaction amount, credit card number, identification information, merchant information, date, time, etc. Various types of data errors may be introduced in each of the millions of credit card transactions are recorded each day. For example, the merchant identifying data field for a transaction record might be tainted with information specific to the individual transaction. As an example, consider a data set of transactions where the merchant name field indicates the merchant name and additional merchant information. This information might be added by the merchants and may include a store number or other merchant specific information that might not be needed by the clearinghouse to authorize or settle the transaction. In some cases it might be important to clean this data to conform to a format that specifies the merchant name without any of the additional information. In other cases, data storage space might be saved by using one of various data de-duplication systems and methods. For example, a name used in many transactions might be saved in one data storage location and a pointer might be saved in other data storage locations.
There are two main types of de-duplication. These methods are inline or offline. Inline de-duplication is performed by a device in the data path. This may reduce the disk capacity required to store electronic data thereby increasing cost savings. A disadvantage of inline de-duplication is that the data is processed while it is being transmitted for backup, which may slow down the backup process.
In contrast, offline data de-duplication does not perform the data de-duplication in the data path, but instead performs the process at the backup system. This may require more data storage capacity, such as, for example, disk capacity. Performance may, however, be improved by having the process reside outside of the data path, after the backup job is complete. In other words, because the data is processed after being transmitted for backup it generally will not slow the transmission of data down.
In some systems, data de-duplication technology uses a dictionary based hashing to eliminate redundant sets of variable size blocks within the data stream. The dictionary lookup method is very effective in reducing the data, however, this approach requires extensive processing power and fast storage devices to reduce the data. This can mean that many dictionary based de-duplication approaches are not suitable for tape backup and may require high disk bandwidth in the Virtual Tape Library systems. Accordingly, in some cases it may be advantageous to use de-duplication technology that does not use dictionary based de-duplication approaches.