Cryptographic hashes can be used in many different applications, including as a hash function to detect duplicate data or uniquely identify files. In a content addressable storage (CAS) system, a hash value generated by a cryptographic hash can be used to “fingerprint” data, allowing a large block of data to be identified by a much smaller hash value. Cryptographic hashes can be used to reduce collisions of data during storage to a low number (e.g., incorrectly overwriting stored data with new data).
CAS systems often deduplicate data automatically. That is, it is very common for the same data to be stored in multiple places, consuming large amounts of space. For example, virtual machine images may contain a majority of the same data (e.g., system files and installed applications). Thus, by only storing the same data once, considerable reductions in storage cost can be achieved. However, given the fundamental nature of CAS systems in that a large number is represented by a smaller number, collisions can still occur, thereby resulting in corrupt data.
Additionally, attacks against the CAS system are possible as hash functions are broken and computation of colliding hashes becomes possible. An attacker can create a bad block of data that computes a hash equivalent to a hash of a block of data already in the CAS storage and then inject the CAS system with the bad block of data. Thus, collisions can be constructed to allow an attacker to create data corruption or substitute data in the system, resulting in reduced user confidence in this type of storage system.