The present invention relates generally to data processing techniques in programming systems having large-scale data sets and, in particular, to protection of privacy data in MapReduce systems.
MapReduce is a software architecture proposed by Google Corporation. The MapReduce architecture is employed for parallel computation on large-scale data set (bigger than 1TB), in which the parallel computation is achieved by distributing a large scale of operations on the data set to individual nodes on the network. It has wide range of applications fields like Web access log analysis, document clustering, machine learning, data statistics, statistics-based machine translation, etc. For example, Hadoop is an embodiment of MapReduce implementation. More and more cloud services providers have deployed the MapReduce framework in their cloud computing systems.
For users that adopt the MapReduce computation provided by cloud computing services, computing nodes of the cloud computing system are in the public domain. In the MapReduce computation process, privacy data of the users are also exposed to the public domain and can hardly be effectively protected. Therefore, many users would like to put the privacy data involved in MapReduce computation processes into private domain for processing, for example, into the private cloud system of the enterprise for processing.