Businesses as well as individuals are becoming increasingly dependent on computers, networks and electronic data storage. Electronic data are usually stored in local storage systems and/or network-based cloud storage systems. As more and more data are generated, the need for efficient and reliable data backup storage systems and methods is also increasing. The rapid growth of data storage requirements, as well as the increasing need for data to be distributed over networks spanning the globe, has led people to seek ways to reduce the amount of data being stored and distributed, without reducing the information or utility of that data. Therefore, the use of data deduplication technology for managing capacity and bandwidth is rapidly emerging as a standard practice.
In the data storage industry, deduplication refers to a process which searches for regions within a file system or disk which contain duplicate data, stores that data in some form of database, and then replaces the regions with references to the database. In a simple file system implementation, for example, multiple copies of the same file would be replaced by links to a central repository, while a more sophisticated implementation might look inside the files for shared segments. Disk systems, also called block-based systems, lack the insight into file system structure, and will typically base their comparisons on the raw blocks of the disk.
Data deduplication technology breaks an incoming data stream into a series of data segments and tests the system for the presence of each data segment before storing it, in order to avoid storing it multiple times. Data deduplication technology also identifies and removes those portions of data which are redundant, thus allowing systems to store and transmit only small references to much larger data segments. Some storage systems that utilize data deduplication technology can achieve high data compression factors of 10 to 50 or more.
The basic approach to deduplication on storage systems includes the following steps. First, data is received by the deduplication subsystem and broken into segments, each of the segments is then tagged by some variant of a hashing code. The role of the hashing code is to serve as a short identifier for a much larger segment of data, and is used as a component in a large index structure. The incoming segment's hash code is compared against existing entries in the index, and if no match is found, it is stored in an entry containing both the hash code and the original data. Some virtual representation of the storage container exists as well, and the hash code is used within that virtual representation as a placeholder for that data segment. If the incoming hash code does match an existing index entry, then that hash code is simply placed into the virtual representation. When a request to access a storage location is received by the storage subsystem, it begins processing by looking within the corresponding virtual representation of the storage segment(s) within that container. The hash codes are retrieved and used to retrieve the original segments from the index. Finally, those segments are used to reconstruct the contents of the original storage location.
There are a number of variations on that basic theme, including fixed- or variable-length segments, in-line or post-process deduplication, or file- versus block-based representation. In-line deduplication is done upon initial receipt of an IO request by the storage subsystem, while post-process deduplication is performed some time after the original data is stored. Post-process deduplication presents less performance overhead, at the cost of having to store all of the original data for some period of time. File-based deduplication works within a file system, searching for duplicate or similar files, while block-based deduplication treats the entire subsystem as a single data stream, without regard for higher-level structure.
Data deduplication is extremely effective in a number of commonly used modern computational environments. Microsoft Exchange, for example, stores as many copies of a file as are sent for distribution. Virtualized server environments like VMWare's ESX server is often configured with a large number of virtual machines, each of which may be extremely similar. In these types of situations, the actual amount of storage used can be greatly reduced, since all of the identical data segments across all of the files and virtual machines will occupy only a single entry in the hash index. As was mentioned above, deduplication ratios of 10:1 are often claimed as the average performance, meaning that for every 10 storage units used in the original data set, only one storage unit is used in the deduplicated set. It is very simple to come up with data sets that achieve much higher deduplication rates simply by including more duplicate data, e.g. by adding more virtual machines.
The variations on the baseline approach all have different trade-offs and impacts on performance, both of the IO processing and the deduplication effectiveness. Using smaller segments, for example, results in more of the segments matching, but the indexing overhead grows to overwhelm the effectiveness of the data reduction. Fundamentally, though, these differences are minor, and in general, all identical data is matched and reduced to a single copy. Some deduplication approaches may seek “almost-identical” segments, comparing two segments, finding that they are almost the same, and then storing just the differences, but these are functionally identical to the variable-length baseline.
In summary, efficient deduplication system and methods are desirable.