This invention relates generally to data deduplication for data storage and network transfers, and more particularly to techniques for transforming data that has been moved and intermingled so that the data that is the same can be identified and deduplicated.
Data deduplication (“DD”) is a data compression technique for eliminating duplicate copies of repeating data to improve storage utilization and to reduce the number of bytes that must be transmitted over a network. Data de-duplication is particularly important in enterprises having big data networks because of the massive amounts of data which must be transmitted over the network, stored and backed up. Deduplication is typically performed in connection with a backup. In the deduplication process, chunks of data, or byte patterns, are identified by a fingerprint such as a hash that is unique to each chunk of data and the fingerprints and chunks are stored. As the process continues, the fingerprints of other chunks are compared to the stored fingerprints and whenever a match occurs, the redundant chunk may be replaced with a small reference or pointer to the stored chunk. Since the same byte pattern may occur frequently, the amount of data that must be stored or transferred may be greatly reduced.
There are certain data layouts that do not deduplicate very well because the files are constantly changing and being rearranged, making it difficult to identify redundant data. Cassandra data files are an example of such files where deduplication problems occur because the data in the files are constantly being merged and rearranged, and because redundant copies of the same data are replicated with different layouts on different nodes of a Cassandra cluster. Cassandra is a non-relational decentralized database that is designed to handle high incoming data volume with data arriving from many different locations. It has a massively scalable architecture with multiple nodes which share data around the cluster so that a loss of a subset of the nodes does not result in a loss of the data, and has the ability to add nodes without going down. It also has multi-data center replication across multiple geographies and multiple cloud environments. Each node of the cluster is responsible for a different range of data, which causes partitions in data files to differ between nodes. Moreover, even if the files were identical between nodes, a typical Cassandra backup requires copying SSTables in which data are stored in a snapshot from all nodes to backup storage. This creates a problem with deduplication in a DDR deduplication appliance which considers only fingerprints for deduplication across different streams. Similar data being written at the same time from different files may or may not deduplicate because of timing differences.
Another characteristic of Cassandra which can create de-duplication problems is compaction. Compaction is a process for combining SSTables to consolidate data and to remove deleted data (tombstones) after an appropriate timeframe. If a tombstone is removed before it can be replicated, the value may remain on other nodes indefinitely and data that should no longer exist may be returned. The result of compaction is the data will be shifted around to different files, and potentially co-located with different data. The constant reordering of data on any given node due to compaction makes it extremely difficult to deduplicate the data, because the algorithms that identify chunk or segment boundaries are not aware that the data has been rearranged.
It has been found with Cassandra that deduplication processes did not provide compression factors greater than low single digits, either running deduplication between nodes known to have replicated data, or when performing repeated full backups of the same node known to have replicated copies of data, indicating that the deduplication processes had difficulty identifying redundant data in Cassandra files.
It is desirable to provide solutions that address the foregoing and other known problems of deduplicating Cassandra and other similar types of variable data files in which data is constantly changing, being reorganized, and being reordered with other data, and it is to these ends that the present invention is directed.