This disclosure is directed to method and system for identification and blocking of privacy vulnerabilities in data streams.
Due to worldwide and local privacy regulations, such as the EU Data Privacy Act and the US HIPAA guidelines, person-specific data have to be properly de-identified before being shared with third parties.
Several real-world cases require the privacy protection of voluminous streaming data. As an example, modern health-related information systems are being designed to handle real-time person-specific data, which are either provided directly by patients or through sensors connected to the patients, and to offer such data with a small delay to different data consumers, still being compliant with existing data privacy regulations and state-of-the-art privacy offerings.
Existing privacy solutions are not designed to handle the anonymization of such massive and fast datasets in a streaming and online fashion, against the various types of privacy vulnerabilities that they may contain. Furthermore, discovering the privacy vulnerabilities in such datasets is a non-trivial task which requires new approaches.
Existing algorithms for the identification of vulnerabilities (in the form of sample uniques) in relational tables are either too slow, cannot scale to medium datasets (in terms of columns/rows), or require a prohibitively large amount of memory to operate. They are also inapplicable in the case of data streams, as they require access to the entire dataset.