In recent years, people have been witnessing a data explosion: 90% of today's data have been produced only in the last two years with the volume of information that is nowadays available being estimated in the order of Zettabytes. These data come from deployed sensors, social networking sites, mobile phone applications, call detail records, electronic medical record (EMR) systems, e-commerce sites, etc.
Analyzing this wealth and volume of data offers remarkable opportunities for growth in various business sectors for a company, e.g., including healthcare, telecommunications, banking and smarter cities management, among many others. However, the majority of these datasets are proprietary and many contain personal and/or business sensitive information. Examples of sensitive data include patient records, special housing information, tax records, customer purchases records, mobile call detail records (CDR), etc. The very sensitive nature of such datasets prohibits their outsourcing for analytic and/or other purposes, unless privacy-enhancing technology is in place to offer sufficient protection.
Among the privacy-enhancing technologies that are available nowadays, the area of privacy-preserving data publishing aims at protecting privacy at a record level. This area comprises techniques for transforming, and subsequently publishing, person-specific data in a way that sensitive information about the individuals is protected, while the data remain useful to support intended purposes. The methods in this area can be categorized into perturbative, such as data masking, noise addition, micro-aggregation, data swapping and rounding, and non-perturbative, such as data suppression and data generalization. Perturbative methods distort the original data values and thereby fail to maintain data truthfulness. Furthermore, it has been proven that they typically lead to low data utility; hence, non-perturbative methods are generally preferred. These latter non-perturbative methods operate by changing the granularity at which data values are reported in the sanitized dataset, in a way that maintains data truthfulness at a record (individual) level. Among non-perturbative methods, data generalization is usually preferred over suppression, because it leads to datasets of higher utility.
The most popular non-perturbative model for privacy-preserving data publishing is k-anonymity. This model requires that at least k records, each corresponding to an individual in a released table, have the same values over a set of potentially identifying attributes, called quasi-identifiers. Different to direct (or explicit) identifiers, such as names, social security numbers, credit card numbers, etc., which can be used in isolation to re-identify an individual, quasi-identifiers are seemingly innocuous attributes (e.g., zip code, gender, date of birth, etc.) which, when used in combination, may lead to identity disclosure. k-anonymity thwarts identity disclosure attacks by guaranteeing that an attacker cannot re-identify an individual in a released dataset, with a probability that is above 1/k, where k is an owner-specified parameter. The k-anonymity model, which was originally proposed for relational data, has been since adapted to various kinds of data, including set-valued data, mobility, longitudinal and time-series data, data streams, social graphs, and textual data, and has been implemented in several real-world systems.
Although many k-anonymity approaches have been proposed for protecting different data types, all existing solutions offer protection for a specific kind of data, e.g., for relational data tables or for transaction (set-valued) data or for social graphs or for temporal data, etc.
While it would be highly desirable to provide a single approach for anonymizing records of individuals that considers not only one specific kind of data in isolation but, instead, protecting datasets in which records comprise two different kinds of data: a relational part and a transaction (set-valued) part, such an approach is challenging. For example, assuming attackers exist who have knowledge that spans these two kinds of data, i.e., they may know certain relational attribute-value pairs (e.g., some demographics) of an individual together with some items of a set-valued attribute (e.g., a set of products that this individual has purchased). In this context, anonymizing records of individuals which comprise of two different kinds of data is a very challenging task, particularly because:                1) Anonymizing each kind of data separately (e.g., by using existing k-anonymization techniques that are relevant to this kind of data) does not offer privacy protection to the individuals from attackers whose knowledge spans the two (or more) kinds of data.        2) Constructing an optimal solution with minimum information loss is an NP-hard problem.        3) Popular multi-objective optimization strategies, such as the lexicographic approach, the conventional weighted-formula or a Pareto optimal approach, are not applicable to these problems. In fact, good anonymization decisions that are taken on one kind of data may be proven disastrous for the other kind of data.        