Many electronic information storage systems store files by first breaking them up into blocks called “chunks” based on their contents, and then storing only one copy of each identical chunk. This process of not storing duplicate copies of identical chunks achieves various storage efficiencies, as a file system typically includes a lot of duplicate content. Importantly, the system identifies identical chunks by comparing a cryptographic hash of the contents of the chunks. A client of the storage system that desires to write a file first communicates only the hashes of the chunks of the file to be written. The storage system responds by requesting the full contents of the chunks that are not already stored, again based solely on a comparison of hash values. Although in theory this system would fail to preserve data integrity when two different chunks hash to the same value (a “hash collision”), the probability of such a collision is so small as to be deemed virtually impossible. This approach of communicating hashes cuts down on the communication bandwidth used between the storage system and its clients during file write operations. Thus, this type of system has the advantages of reduced storage overhead and communication bandwidth when compared with other types of systems, and operates well in the domain of archival storage systems, where the interaction with the system is only through well controlled client software. However, moving this type of system into the domain of general file systems exposes a problem of data privacy.
One might assume that such a system, which uses collision-free one-way hash values to effectively name its file chunks, is immune to a data privacy attack. However, a user of the system who can guess the contents of another user's file chunk can determine if a chunk with that content exists in the system. The simplest mechanism would involve a read request that specifies the hash of the guessed contents. If the storage system has that chunk, it will respond with the data, otherwise with an error. If the storage system provides an access control mechanism to prevent such read probing, two write-based attacks are still possible. The first write-based attack involves attempting a write of the guessed chunk, and then observing whether the system requests the full chunk contents. If that low-level interface to the storage system is not available to the user, simply timing the storage of the guessed chunk would indicate whether or not it is already present on the system. Clearly, being able to guess a limited set of possibilities for the contents of a file is not uncommon. Being able to confirm the actual content from such a guessed set would be useful to an adversary.
For example, a file may consist of the simple message “The attack starts at dawn.” An adversary can create files with the messages “The attack starts at midnight.”, “The attack starts at noon.”, and “The attack starts at dawn.” Probing the file system with the hashes of these three files to learn which file is already stored reveals when the attack will occur.
An example of this problem is further illustrated in FIG. 1. As FIG. 1 illustrates, Alice and Bob each have a copy of a public memo. They each also have private copies of an award letter, detailing how many stock options shares each of them have been awarded (as illustrated in this example, 1000 for Alice, and 20 for Bob). Say that Alice saves the private copy of her public memo first. The memo is first divided into chunks A and B based on the content, in this case the header and the body. Alice first sends the hashes of the chunks, H(A) and H(B). The storage system does not have chunks with these hashes, so it asks Alice to send them (this is illustrated as the “send H(A), H(B)” box in FIG. 1). Alice proceeds to send chunks A and B. Bob later stores his copy of the public memo. When he sends hashes of his chunks, H(A) and H(B), the storage system replies that Bob need not send the chunks (this is illustrated as the “got it!” box in FIG. 1). If Alice then stores her private memo, she sends the hashes of the chunks A and C, H(A) and H(C). The storage system replies that it only needs the chunk labeled C, which Alice sends. When Bob later stores his private file, the storage system reports that he need only send the chunk labeled D. Now, if Bob wants to know if anyone received 1,000 option shares, he can create a file with the same content as his private file, replacing 20 with 1,000. If the storage system does not ask for the chunk, Bob knows that someone got 1,000 option shares, which is more than he received. If the private file also contains the recipient's name, Bob can find out who got more options by replacing his name with the names of all his co-workers. If Bob does not have access to the low level protocol, he can time how long it takes to store his guessed chunks.
What is needed are methods, computer readable media and computer systems that preserve the reduced data storage size and reduced data communication bandwidth provided by existing hash based storage systems, yet at the same time provide data privacy which the existing systems lack.