This disclosure generally relates to the field of data storage, and, more particularly, to reducing data duplication in storage.
To efficiently store date, remote storage providers employ data deduplication. Instead of maintaining separate copies of a large chunk of data (e.g., a file or section of a large file), data deduplication eliminates duplicates and references the large chunk of data from metadata associated with different users. Remote storage providers can employ post-process data deduplication or in-line data deduplication. In addition, a data source can perform data deduplication.
As an attempt to secure data, hash values are used to efficiently prove ownership of data since the hash values are substantially smaller than the corresponding data. But these shorter pieces of information, sometimes referred to as fingerprints, have vulnerabilities. The hash functions are publicly known. An attacker can generate numerous hash values with the publicly known hash functions, and feign ownership of a file if any of the generated hash values happen to match a hash value at a remote storage provider. The attacker can use the hash value as proof of ownership and retrieve the entire file.