Many key operations in data mining involve computation of aggregate functions of data held by multiple parties, with one party trying to find out an answer to a query based on the information held by the other party. For example, in a scenario involving two parties, such as a client and a server, the client may be trying to determine a distance between a query vector held by the client and a vector in the server's database, for purposes such as assessing a similarity between the client's vector and the server's vector. Similarly, the client may be trying to retrieve nearest neighbors of the query vector from the server's database. Likewise, the client may be interested in retrieving vectors from the server that have a large enough number of unique elements. All of these scenarios involve applying aggregate functions of multi-party data—functions that perform a computation on corresponding elements of the client's query vector and the server's vector and aggregate the result of the computation over the length of the query vector and the server's vector.
These aggregate functions become challenging to compute when privacy of the client's query and privacy of the server's data needs to be protected and current approaches to implementing the functions while preserving the privacy are inadequate. Conventionally, the preferred solution to this challenge is to encrypt the client's query vector using a homomorphic cryptosystem and to transmit the encrypted data to the server. The server then performs the aggregate function computation using the additively and possibly multiplicatively homomorphic properties of the cryptosystem and returns the encrypted result to the client. Only the client has the private decryption key for the cryptosystem, thus allowing only the client to decrypt the aggregate result. The server performs computations only using encrypted data and thus does not discover the client's query. The drawback of this approach is that significant computational overhead is incurred owing to the encryption and the decryption, as well as due to transmission, storage and computation of ciphertext data. Therefore, such encrypted-domain protocols are costly, require additional hardware resources, and reduce the speed with which the client obtains an answer to a query. Furthermore, such an approach compromises the privacy of the parties involved in the data exchange.
The following examples illustrate the disadvantages of using the encrypted domain protocol such as described above. The client's data is denoted by Xq and the server's data is denoted by Yi, where i=1, 2, . . . , N. Thus, in this scenario, the server has N items. The aggregate function is denoted by denoted by f(Xq, Yi). Using the homomorphic cryptosystems approach described above, the client can decrypt the result f(Xq, Yi) for each i. To illustrate why this approach compromises the privacy, consider an example in which f(Xq, Yi) is the distance between Xq and Yi. Now, suppose the goal of the protocol is to deliver to the client the K nearest neighbors of Xq, while preventing the server from knowing Xq, and preventing the client from knowing anything about the faraway Yi. However, if the above approach is followed, the client discovers the distance of Xq not just from the K nearest neighbors, but from each and every Yi. Thus, the server's privacy is compromised and the client learns how the server's data is distributed with respect to Xq.
Consider a second example in which f(Xq, Yi) takes value of 0 if Yi has at least as many unique elements as Xq, and takes value of 1 otherwise. Suppose the goal of the protocol is to deliver to the client those Yi for which f(Xq, Yi)=0, while preventing the server from discovering Xq, and preventing the client from knowing anything about those Yi for which f(Xq, Yi)=1. Encrypted domain protocols exist that operate on the histograms representing Xq and Yi, and return to the client the difference in the number of unique elements in Xq and Yi for all i. Thus, the client discovers not only which Yi's have at least as many unique elements as Xq, but also discovers the number of unique elements in each of the Yi's. Accordingly, the server's privacy is compromised and the client learns how the server's data is distributed with respect to Xq. The client receives more information than the client needs to answer the query, as the goal was to only deliver those Yi for which f(Xq, Yi)=0.
To protect the server's privacy, a special encrypted domain protocol has been used to prevent the client from learning the value of f(Xq, Yi), for those signals Yi which are not the nearest neighbors of Xq, such as described by Shaneck et al. “Privacy preserving nearest neighbor search,” Machine Learning in Cyber Trust, Springer US, 2009. 247-276, and by Qi et al., “Efficient privacy-preserving k-nearest neighbor search,” The 28th IEEE International Conference on Distributed Computing Systems, 2008. ICDCS'08, the disclosures of which are incorporated by reference. However, these encrypted domain protocols increase the ciphertext overhead, further compounding the speed and the hardware resources problems described above.
Other approaches have been implemented to attempt to reduce the computational burden of the special encrypted domain protocol. For example, Boufounos and Rane, “Secure binary embeddings for privacy preserving nearest neighbors,” IEEE International Workshop on Information Forensics and Security (WIFS), 2011, the disclosure of which is incorporated by reference, describes a way to conduct a two-party protocol in which a client initiates a query on a server's database to discover vectors in the server's database that are within a predefined distance from the query. The protocol utilizes a locality-sensitive hashing scheme with a specific property: the Hamming distances between hashes of query vectors and server vectors are proportional to the distance between the underlying vectors if the latter distance is below a threshold. The hashes do not provide information about the latter distance if the latter distance is above the threshold. While addressing some of the concerns associated with the solutions described above, the protocol nevertheless requires significant additional computational overhead due to the need to obtain the hashes using computations the encrypted domain.
Accordingly, there is a need for a way to compute functions of multi-party data while preserving privacy of the parties and while reducing computational overhead of the computation.