The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for privately sharing semi-structured data, such a network structure data, for example.
The problem of privacy-preserving data mining has attracted considerable attention in recent years because of increasing concerns about the privacy of the underlying data. In recent years, an important data domain which has emerged is that of graphs and structured data. Graphs are data structures used to represent complex systems using nodes and edges between nodes. An object, or a part of an object, is represented by a node and the interrelationship between two objects is represented by an edge. Many different types of data sets are naturally represented as graphs, such as Extensible Markup Language (XML) data sets, transportation network data sets, data sets representing traffic in IP networks, social network data sets, hierarchically structured data sets, and the like.
Existing work on graph privacy has focused on the problem of anonymizing nodes or edges of a single graph, in which the identity is assumed to be associated with individual nodes. There are many examples of approaches to graph privacy that have been devised. For example, R. Agrawal et al., “Privacy-Preserving Data Mining,” Proceedings of the ACM SIGMOD Conference, pp. 439-450, 2000 establishes the field of privacy preserving data mining in the context of database mining. This paper describes how useful mining information can be extracted from randomized data. D. Agrawal et al. “On the Design and Quantification of Privacy Preserving Data Mining Algorithms,” Proceedings of the ACM PODS Conference, pp. 247-255, 2001 describes the tradeoffs between privacy and accuracy in data mining algorithms. This paper establishes a framework for quantification of privacy in the context of information theory.
As a further example, in P. Samarati et al., “Protecting Privacy when Disclosing Information: k-Anonymity and its Enforcement Through Generalization and Suppression,” Proceedings of the IEEE Symposium on Research in Security and Privacy, May 1998 involves a methodology to reduce the granularity of the data so that each individual is indistinguishable from at least k other individuals. Moreover, V. Verykios et al., “State-of-the-Art in Privacy Preserving Data Mining,” SIGMOD Record 33(1): pp. 50-57, 2004 a survey of various privacy preserving data mining methodologies is provided.
A key method in privacy preserving data mining is that of k-anonymity. In the k-anonymity method, the data is transformed such that each record is indistinguishable from at least k other records in the data set. Because of this transformation, it is much more difficult to use publically available databases, or other available database, to infer the identity of the underlying data. Most k-anonymization work is focused on continuous and categorical data domains (see P. Samarati et al., discussed above).
The key techniques used for anonymization are those of generalization and suppression. In the case of a multi-dimensional data set, the process of generalization refers to reducing the granularity of representation of the underlying data. For example, instead of specifying an age attribute exactly, one may only choose to specify it as a range. In suppression, one may choose to completely remove either a record or an attribute value from a record. The idea is to reduce the granularity of representation such that a given record cannot be distinguished from at least k records in the data set. This transformed data can then be used for privacy-preserving or other mining applications.
An alternative to data generalization and suppression is that of synthetic pseudo-data generation which preserves the aggregate properties of the original data. one technique for performing such synthetic pseudo-data generation is described in C. C. Aggarwal, “A Condensation Based Approach to Privacy Preserving Data Mining,” Proceedings of the EDBT Conference, pp. 183-199, 2004. The process of synthetic pseudo-data generation requires creation of groups of tightly clustered records followed by estimation of the statistical properties of each of these clusters. These estimated statistical properties are used in order to generate the data records from each of the clusters. The core idea is that while the generate data is synthetic, it preserves the aggregate properties and can therefore be used in conjunction with data mining tasks, such as classification, which are dependent upon aggregate properties of the original data.
Regardless of which anonymization technique used, it should be appreciated that these known anonymization techniques only operate on a single individual graph. That is, the anonymization technique are not applied to a plurality of graphs.