The anonymization of data sets that include privacy-critical information is of great relevance in many common business data processing scenarios. Generally, anonymizing data sets accomplished utilizing approximated techniques. These techniques are still very demanding in terms of computational resources, because of the characteristics of existing algorithms and due to the typically large size of datasets to process. Consequently, anonymization is typically done off-line (e.g., not in real-time).
The increasing availability of large and diverse datasets (e.g., Big Data representing customer data, transactions, demographics, product ratings, and the like) help businesses acquire insights on their markets and customers, and predict what's next. Fully exploiting big data raises various issues related to the possible disclosure of sensitive or private information. In particular, big data often contains a large amount of personal information, which is subject to multiple and stringent privacy regulations (EU data protection directive, HIPAA, and the like). In fact, data protection and privacy regulations impose strong constraints on the usage and transfer of personal information, which makes handling the data complex, costly, and risky from a compliance point of view. As a consequence, personal data are often classified as confidential information, and only a limited number of business users (e.g., high level managers) have access to the data, and under specific obligations (e.g., within the perimeter of the company network, no transfer to mobile devices, and the like). However, many business applications (e.g., business analytics and reporting, recommendation systems) do not need all the personal details on specific individuals, and an anonymized version of the dataset is still an asset of significant value that can address the business requirements in most of the cases.
Anonymization may increase protection, lower the privacy risk, and enable a wider exploitation of data. However, Anonymization techniques are typically computationally intensive. As a result, Anonymization is conventionally limited to off-line scenarios or small size datasets which diminishes their business impacts by not allowing the usage for more advanced applications (e.g., real-time analytics and on-demand data services).
In conventional technologies, querying a large database and extracting an anonymized dataset in real-time is not possible, and most anonymization processes are run offline (i.e., as a batch processes). Typically users are prevented from retrieving data from databases as soon as these databases provide, even if only in some specific table, some sort of personal information. Therefore, a need exists for processing large volumes of data, in real time, as well as anonymizing the data as necessary in real time.