Cloud storage as an archive backup resource allows users to store data off-site and minimize on-site storage resources; but cloud storage services may impose certain costs, especially if a large amount of data is archived. Certain techniques, such as data deduplication, compression and other forms of data optimization are often employed to reduce the amount of stored data sets by assigning one copy of a file (F) to multiple clients. A deduplication scheme stores only a single copy of repeating data and is most effective when applied across multiple users, which is a common scenario in cloud storage environments. However, certain side-channel attacks can be used to gain access to arbitrary size files of other users based on small hash signatures of these files.
Most deduplication systems maintain a database containing a hash h(F) of every currently stored file (or file fragment) F. Along with this hash is stored an access-control list enumerating the clients that have uploaded F and thus have the right to retrieve it. When a client presents a file G for deduplication, the system checks whether its hash h(G) already exists in the database as the hash h(F) (=h(G)) of a previously stored file F. If so, G is presumed to be identical to F. In this case, G is not stored in the system, and typically is not uploaded from the client. Instead, G is mapped onto F, in the sense that the client is enrolled on the access-control list for G. There are at least three types of attacks against such deduplication systems, including probing attacks, content-distribution network attack, and exfiltration attacks.
In a probing attack, if a client presents G (or h(G)), and the system does not upload G, then the client learns that G is already present in the system, and belongs to another client. Thus, a side-channel reveals the repository contents of existing clients, and sometimes the mere existence of a file F can leak sensitive information. Additionally, an attacker can use probing to mount a form-filling attack. For example if an attacker has access to a form F (e.g., a tax form) that a victim has filled in a particular field, S, (e.g., annual salary) and uploaded as a file F′, and if the search space (entropy) for S is small enough, the attacker can learn S by repeating the following procedure: filling in known values (e.g., the victim's name and address) guessing a plausible value S* for S, constructing the associated filled-in form F*, and testing whether F*=F′.
In a content-distribution network (CDN) attack, a client may be enrolled on the access-control list for a file F merely by presenting the hash h(F). In essence, h(F) is treated as a credential for access to F. Consequently, one client can provide access to a large file F to other clients merely by presenting them with the compact value h(F). To obtain the file F, a client can falsely “deduplicate” F by presenting h(F), thereby gaining access rights that permit retrieval of F. For example, if a user wants to distribute a bootlegged video F through a backup service, he creates a free account and uploads the video, and makes the hash h(F) available to receivers. To obtain the video, a receiver sets up a free account, falsely “deduplicates” F by presenting h(F), and then retrieves F.
In an exfiltration attack, malware often seeks to exfiltrate sensitive data from clients, but confronts the challenge of initiating high-bandwidth, outbound connections without triggering intrusion alerts. An existing deduplication system can be exploited to create such a connection. A piece of malware can exfiltrate data F from a client via deduplication by instantiating F in a one-time content-distribution network. The result is a dropbox with a compact access credential h(F), and which is accessible from any client within the deduplication system.
These, and other side-channel attacks represent sources of vulnerability associated with present deduplication systems implemented in current cloud storage environments. Although certain preventative measures are available to require that requesting users prove ownership or authorization over target files, most are complex, resource-intensive solutions that impose high overhead costs. Moreover, such solutions do not always provide absolute certainty of proof-of-ownership by the client and are susceptible to sophisticated malware attacks.