1. Problems with De-Duplication of Data Transmission
In network storage systems which employ de-duplication of data transmission, there exists a security concern. Generally in these systems, the storage system contains a set of files or file pieces and has these files indexed by content (e.g., with a Secure Hash Algorithm (i.e., SHA-1) hash). Clients of such a system can eliminate the transfer over the network of files or file pieces which already exist in the system by first querying the system whether the content identifier (i.e., a SHA-1 hash) for each particular data piece exists, and only sending the pieces of data which the storage system does not already have. The storage system can read out of its own storage the duplicate pieces of data referred to by the client instead of requiring the client to send them over the network.
The security concern lies in the fact that clients of the system can “byte twiddle” to produce likely matches to files on the storage system, and, thereby deduce from the storage system's response to transmission de-duplication requests whether the file or the piece of the file already exists on the system. For example, if a system stored slightly modified form letters for employees of a company describing the employees' bonuses for a year, a malicious client, “Bob”, of the system could (i) take his form letter and change the name on the letter from “Bob” to “Alice”, (ii) change the bonus from $10 to $11, and (iii) ask the storage system if such a file already exists in the system. If so, Bob would have discovered Alice's bonus. If not, Bob could try $12 dollars and so on until the bonus is discovered.
2. Prior Art Systems
Referring to FIG. 1, a prior art system for addressing this security concern is to store access information in the storage system for each of the pieces of data. Before the storage system responds to the client that it already has a particular piece of data, the system first checks to make sure that the client has sufficient permission (i.e., read permission) to the data in question. For whole-file network de-duplication, this can be done through an Access Control List (ACL) check before responding to the client. For sub-file de-duplication systems, it is more difficult because each of the data pieces is not associated with a particular ACL, and each data piece may be part of many different files. In that case, the storage system must store a member list for each of the file pieces to determine which files it is part of. Then, the storage system must check each of the ACLs for the member files to find at least one which grants sufficient permission (i.e., read permission). This method requires the maintenance of a list of members for each piece of data, and results in a slower de-duplication process because so many ACLs must be checked.
Thus, there is a need to eliminate the need for storing file member information for each file piece and to eliminate the need to check ACLs for de-duplication hits. Therefore, a method and system of detecting malicious behavior in a series of data transmission de-duplication requests of a de-duplicated computer system is needed.