Single instancing schemes can be performed between clients and servers in a communication network in order to minimize the amount of data that travels over the network, where the data will be ultimately stored on a server. In a single instancing scheme, a hash value of a chunk of a file (chunk-based hashing) or a hash value of a file (filed-based hashing) is first transmitted from the client to the server, and the server then compares this transmitted hash value with the server-stored hash values. If the transmitted hash value matches one of the server-stored hash values, then the server will inform the client that the data (e.g., chunk or file) is already stored in the server and that the data is not required to be transmitted by the client to the server. Therefore, this compare-by-hash mechanism allows the server to determine if the data (e.g., chunk or file) to be transmitted by the client to the server is already stored in the server, by use of hashing.
Single instancing schemes (chunk-based or file-based) offer a significant potential for network and storage bandwidth savings because data is not transferred across the network if there is a match in the comparison of hash values. However, the compare-by-hash mechanism in these schemes introduces the possibility of hash collision which involve two different pieces of data that result in the same hash value. During a hash collision scenario, the server will detect an equality between the transmitted hash value of the data to be sent from the client to the server and a stored hash value of a different data that is stored in the server. Because of the equality in hash values, the server will inform the client that the data corresponding to the transmitted hash value is already stored in the server. As a result, the client will not transmit the different piece of data to the server for storage, and the required storage into the server of this different piece of data will not occur.
Hash collisions, as well as software errors and hardware errors, can potentially result in data corruption. However, software errors and hardware errors are also non-deterministic in nature. In contrasts, hash collisions are deterministic in nature which means that hackers can potentially perform vulnerability exploits on the stored data in a network device. For example, in a distributed file system (e.g., Low Bandwidth File Systems or “LBFS”) where network devices (e.g., clients and/or servers) will exchange data on-demand, a hostile network device can inject invalid data or corrupted data to a receiving network device before the valid data is transmitted to the receiving network device. In a hash collision scenario, the previously-injected invalid data and the valid data to be transmitted to the receiving network device will have the same hash value. Since the receiving device will detect the same hash value for the previously-injected invalid data and the valid data, the receiving device will not receive and will not store the valid data. In an archival file system, a hostile network device would have to pre-inject the invalid data into the receiving network device before the receiving network device receives the valid data. If the invalid data and the valid data have the same hash value, then the receiving network device will not receive and will not store the valid data.
One prior approach to reduce the likelihood of a hash collision is by using larger hash keys such as, for example, SHA-512 (Secure Hash Algorithm—512 bits) or SHA-1024. Several archive vendors have adopted this prior approach of using larger hash keys to represent data content. However, this prior approach does not eliminate the above-discussed deterministic nature of the compare-by-hash mechanism and also does not detect a hash collision condition. Therefore, the current technology is subjected to at least the above constraints and deficiencies.