In general, data classification is well known in the art. Of particular interest are binary classifiers. Such classifiers simply give a ‘yes’ or ‘no’ answer to indicate whether a particular sample belong to a particular class or not.
Private data classification enables two parties, Alice (a client) and Bob (a server), to engage in a protocol that enables Alice to classify data using Bob's classifier, without revealing anything to Bob, not even the classification result, and without learning anything about Bob's classifier, other than an answer to a classification request. Private data classification brings together the fields of machine learning and cooperative, secure multi-party computing, which is a sub-field of cryptography.
When data are collected, stored, and manipulated for various uses, privacy considerations are crucial. The area of privacy-preserving data mining addresses the need to perform specific data mining tasks, while maintaining the privacy of the data of all parties involved, so that no information is revealed, Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in CRYPTO: Proceedings of Crypto, 2000.
One important task is that of data classification, where a large amount of labeled data is used to classify query data samples. The most common density-based classification technique is k-nearest neighbor (k-nn) classification.
K-nn classification can be used for data mining, pattern recognition, and data compression applications. Many of the applications that require classification are highly sensitive, where one party (Alice the client) has a private query data sample to be classified, and another party (Bob the server) has private labeled data and a private classifier. Neither part wants to reveal any private information during the classification.
For example, a client would like a laboratory to classify a biological sample without the laboratory knowing the result of the classification. In addition, the laboratory does not want to reveal its classification process. In another application, a client would like to have surveillance data, e.g., images in a video, analyzed without revealing the content of the video. In addition, the party that performs the analyzing does not want to reveal its analysis tools.
It is desired to perform privacy-preserving k-nn classification, as well as a Parzen window classification, and other approximate k-nn classifications.
Classification Background
K-nn classification is a non-parametric method for estimating a density of a probability distribution function (pdf) at a particular query sample. Density estimation can be used for classification by taking a maximum posterior solution. That is, the density for each category is measured and the query sample is classified by a label corresponding to the maximum posterior solution. One variant of this method uses a Parzen window classifier. Both methods use a small neighborhood around the query sample to estimate the density of the pdf.
A size of the neighborhood can be defined by distance, as is done with the Parzen window classification, or by order, as is done with k-nn classification. It is possible to convert the order to the distance by determining the distance of the kth nearest neighbor and then applying the Parzen window method using this distance.
K-nn classification does not require training. However, the method is relatively slow because all the labeled samples in the classification database must be tested to determine whether any of the labeled samples are in the neighborhood of the query sample. To accelerate this process, an approximate nearest neighbors can be used instead of exact nearest neighbors. The goal of the approximate nearest neighbor methods is to accelerate the classification method at the expense of determining only approximate nearest neighbors and not exact nearest neighbors.
It is desired to perform the above classifications in a secure manner so that the private query data sample is not revealed to the party performing the classification, and the classifier is not revealed to the party having the private query data sample.
Cryptographic Background
Secure Computation
Two-party private computation is described by A. C. Yao, “Protocols for secure computations,” in Proc. 23rd IEEE Symp. on Foundations of Comp. Science, pp. 160-164, 1982. Two parties, usually referred to as Alice and Bob, have private inputs (x, y), respectively. The parties want to evaluate a function ƒ using these inputs, such that at the end, Alice obtains ƒ(x, y) but has no other information on the input y. Bob should not learn anything from the evaluation. Assuming the existence of an oblivious transfer (OT) protocol, Yao gives a constructive answer to this problem. Yao describes how to transform a circuit for determining ƒ(x, y) into a protocol, such that if Alice and Bob follow the protocol, at the end of the protocol, Alice obtains ƒ(x, y) and Bob learns nothing.
The OT protocol can be extended to any functionality in such settings, O. Goldreich, S. Micali, and A. Wigderson, “How to play any mental game or a completeness theorem for s with honest majority,” in ACM Symposium on Theory of Computing, pages 218-229, 1987. In principle, these constructions could provide a secure protocol for k-nn classification. However, the general constructions become impractical even for “simple” functions, and are hard to implement for most real-world applications.
In particular, when the database is huge, as is typically the case in data mining applications, the performance of the general constructions is completely prohibitive.
Efficient Private Data Mining
In an example application, multiple entities have databases of medical records. Sharing the databases can improve medical research. However, the privacy of the patients must not be compromised. In another application, competing companies may benefit from sharing market research databases. However, the companies do not wish to reveal any secret business information to their competitors.
One method provides a private protocol to determine a polynomial P(x) when Alice knows P and Bob knows x, M. Naor and B. Pinkas, “Oblivious polynomial evaluation,” in Proc. of the 31st Symp. on Theory of Computer Science (STOC), pages 254-254, 1999. Their result was used for a protocol for private computation of ID3 learning trees by Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in CRYPTO: Proceedings of Crypto, 2000. A special case of secure polynomial evaluation is a secure dot-product, and various solutions to this problem are known, H. L. B. Goethals, S. Laur and T. Mielikainen, “On secure scalar product computation for privacy-preserving data mining,” in the 7th Annual International Conference in Information Security and Cryptology, 2004, and Y. Chang and C. J. Lu, “Oblivious polynomial evaluation and oblivious neural learning,” in ASIACRYPT: Advances in Cryptology, ASIACRYPT: International Conference on the Theory and Application of Cryptology, LNCS, 2001.
Another protocol for distributed k-means determines a minimum of a shared list of numbers by resorting to general Yao circuit evaluation, G. Jagannathan and R. N. Wright, “Privacy-preserving distributed k-means clustering over arbitrarily partitioned data,” in KDD '05: Proceeding of the eleventh ACM SIGKDD Int. Conf. on knowledge discovery in data mining, pages 593-599, 2005.
Yang et al. consider a scenario of a data miner that surveys a large number of customers to determine classification rules for data, while protecting their privacy, Z. Yang, S. Zhong, and R. N. Wright, “Privacy-preserving classification of customer data without loss of accuracy,” Proc. 5th SIAM Int. conf. on Data Mining, Apr. 21, 2005. Yang and Wright describe a privacy-preserving protocol for determining a Bayesian network on vertically partitioned data, R. N. Wright and Z. Yang, “Privacy-preserving Bayesian network structure computation on distributed heterogeneous data.” in KDD '04: Proceeding of the tenth ACM SIGKDD Int. Conf. on knowledge discovery in data mining, pages 22-25, 2004.
One private distributed k-nn classifier is described by M. Kantarcoglu and C. Clifton, “Assuring privacy when big brother is watching,” in the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD2003), 2003. However, there all parties know the query sample. That is, the query sample is not private. Furthermore, their classifier requires an untrusted, non-colluding third party. The only privacy they provide is in that Alice learns nothing about the databases of the other parties. Alice only obtains the final classification result.