With data computing in general, data de-duplication is a form of data compression used to eliminate redundant data and increase storage utilization. In the de-duplication process, duplicate data may be deleted, which leaves only a unique copy of the data to be stored along with references to the unique copy of the data. In general, de-duplication may reduce the required storage capacity since only the unique data is stored.
Depending on the type of de-duplication being implemented, the number of redundant data files may be reduced, or, even portions of data files or other similar data may also be removed. Different applications and data types have different levels of data redundancy. Backup applications generally benefit the most from de-duplication due to the nature of repeated full backups of an existing file system.
In operation, de-duplication identifies identical sections of data and replaces them by references to a single copy of the data. Data de-duplication increases the speed of service and reduces costs. Data de-duplication increases overall data integrity and includes reducing overall data protection costs. Data de-duplication allows users to reduce the amount of disk space they need for data backup by 90 percent or more. It also reduces the data that must be sent across a WAN for remote backups, replication, and disaster recovery.
Data de-duplication is particularly effective when used with virtual servers, providing the ability to de-duplicate the virtual system state files used when deploying virtual servers. In many cases, virtual servers contain duplicate copies of operating system and other system files. Additionally, when backing up or making duplicate copies of virtual environments, there is also a high degree of duplicate data. Data de-duplication can provide considerable capacity and cost savings compared to the conventional disk backup technologies.
However, when data is transformed, de-duplicated and/or accessed, certain concerns arise about potential loss of data or data integrity (e.g., unauthorized access). By definition, data de-duplication systems store data differently from how it was previously written. As a result, there are concerns with the integrity of the data. However, the integrity of the data will ultimately depend upon the design of the de-duplicating system, and the quality used to implement the algorithms. One method for de-duplicating data relies on the use of cryptographic hash functions to identify duplicate segments of data. If two different pieces of information generate the same hash value, this is known as a collision. The probability of a collision depends upon the hash function used, and although the probabilities are small, they are always non-zero. As a result, concerns arise that data corruption can occur if a hash collision occurs.
In addition to the prospect of data integrity, malicious attacks are another major concern anytime user data is accessed and/or modified. If user data is processed for de-duplication then the user data is increasingly vulnerable to unauthorized access to the backup data. Providing data security during the data backup, data de-duplication and data restore stages of data processing is important.