Big data, open data, and data protection are three terms that represent the current tension between individual privacy, data-driven business, and politics. With every second a tremendous amount of data is being produced, consumed, and analyzed. Data analysis promises benefits such as process improvement, insights, and predictions of trends. For instance, it has been estimated that the European government could save more than 100 billion Euros in operational efficiency improvements alone by analyzing the “right” data. Corporations, academia and consumers thus potentially profit from effective data analytics. One of the biggest challenges is getting access to this “right” data.
Currently some governments provide global data access for analysis, like the European Open Data Portal or the Federal Statistical Office of Germany to name a few. Sharing data, however, may harm the individuals whose personally identifiable information (“PII”) is contained in the data to be shared. A common example would be diabetes patient datasets from, hospitals where even the knowledge of the existence of an individual within the dataset may cause harm to the individual. This is due to the fact that an employee with diabetes causes higher costs for his or her company. Accordingly, such an insight about a person applying for a job may cause the hiring company to prefer other job candidates. This is in tension with the fact that having good datasets about patients with diabetes may provide important information regarding side effects and possible treatments. State of the art intrusion detection systems (“IDS”) operate on data collected from various sensors. Sensors extract log data from information systems, network equipment or cyber-physical devices and send it to analytics systems. SAP Enterprise Threat Detection is one example of such a system. A challenge arising in such a setup, especially when the analytics system is operated in the cloud is the processing of personal data that is contained within the extracted logs. Thus, information of tremendous value for detecting threats and intrusions in the over 96,000 SAP systems often is not utilized.
Providing public access to personal data, such as SAP log data, is unlawful under certain circumstances in some jurisdictions such as in Germany, under the German privacy regulations, “Bundesdatenschutzgesetz”. Even where personal data may be lawfully provided to data analysts under, for example, a data protection agreement, employees may be uncomfortable in the knowledge that their activities could be directly analyzed and potentially become personally identifiable.
Simply removing the names and PII from data (“de-identification”) is generally not sufficient. In several well-known cases, data have been re-identified using publicly available information. For example, de-identified data was published by Netflix and AOL, and reporters and researchers were later able to re-identify users' usage and searches using information available on the Internet.
This problem is not only restricted to the analysis of network traffic but is also highly relevant to business process data as well. If business process data is analyzed for insights on efficiency, etc. then the employees whose data is being analyzed might be uncomfortable if the data can later be attributed to them.
Conventional data anonymization systems offer mechanisms to satisfy data protection requirements. Data perturbation involves modifying data in such a way to prevent re-identification. Data perturbation can be performed to a greater or lesser degree, resulting in correspondingly greater or lesser privacy. However, in some cases, perturbed data deviates so much from the original data that it becomes essentially useless to a data analyst. The level of usefulness of perturbed data is referred to as its “utility”. In many cases, the greater the privacy, the lesser the utility and vice versa.
Epsilon-differential privacy is a mathematical definition of privacy that seeks to define a privacy parameter, epsilon, that quantifies the privacy risk posed by releasing perturbed sensitive data. Perturbed data can be provided in an “interactive” manner, meaning that the original data is perturbed “on the fly” in response to an interactive query. This kind of interactive perturbation can be done under epsilon-differential privacy to a specific value of the privacy parameter, but in practice, the interactive approach is limited in performance as well as in the number and types of queries that can be performed. By contrast, non-interactive or one-time perturbation is done prior to analysis rather than in response to a query. In connection with non-interactive perturbation, an entire perturbed dataset can be provided to data analysts who can then perform any number of arbitrary queries on the entire perturbed dataset.
Frank McSherry described an interactive differential privacy approach in a 2009 paper entitled “An Extensible Platform for Privacy-Preserving Data Analysis”. The paper described perturbing single queries that are needed in order to perform intrusion detection. Such systems have several problems. The first problem relates to the general nature of perturbation, namely the inaccuracy of perturbed data, e.g., completely wrong ordering of events due to excessively perturbed timestamps. The second problem is the expression of data analysis in a high-level query language, whose expressibility is restricted by nature. This problem is specific to the interactive differential privacy approach, which is incorporated into the platform Privacy Integrated Queries (PINQ) from the McSherry paper.
A non-interactive input perturbation approach is described in a 2014 paper by Ulfar Erlingsson et al. entitled “RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response”. The RAPPOR mechanism transforms values within a dataset into a bloom filter and randomizes the bits of the filter with the concept of randomized response. This means that the sample space in RAPPOR is a set of bloom filters, which allows for data mining algorithms to be applied immediately after the randomization of the dataset, without any further modifications on the algorithm steps or the randomized dataset. This random transformation, however, renders the dataset unuseful for many purposes, and advanced statistical decoding techniques must be applied to garner statistical information from the randomized Bloom-filter-based RAPPOR responses. Moreover, RAPPOR randomizes only discrete values and cannot operate on both discrete and continuous data with fine-grained designed utility guarantees. Furthermore, in the case of RAPPOR, the differential privacy definition requires that a neighboring dataset involves the modification of only a single record.
Much effort has been invested in the area of improvement of utility, see e.g. the 1998 paper from Gouweleeuw et al., entitled “Post Randomization for Statistical Disclosure Control: Theory and Implementation” (PRAM). PRAM is randomized response on an existing dataset after the data is generated or provided.