Businesses are looking to make data-driven decisions by using machine learning methods. Unfortunately, many organizations who wish to adopt machining learning techniques to analyze data and make decisions based on data analysis face various challenges. For example, some organizations do not have the resources to collect large datasets that are relevant to their business. Others struggle with hiring a sufficient number and/or appropriately-skilled data scientists. Another significant challenge for many organizations is that the data they wish to analyze may include sensitive data (e.g., information that is proprietary, confidential, and/or under protection order, secrecy order, or requires special/government clearance for access) or private data (e.g., personal data containing identifying particulars of an individual or entity).
More specifically, for new or resource-constrained organizations (e.g., a new tech startup), one barrier to analyzing data is not having enough of it. New and traditional machine learning techniques assume a large number of data points that would come with a large user base. For example, the recently published AlphaGo system samples 30 million data points after analyzing millions of games (see Silver et al., “Mastering the game of go with deep neural networks and tree search,” Nature, 529 (7587):484-489, 01 2016), and ImageNet uses a neural network trained with 15 million images from a publicly available dataset (see Krizhevsky et al., “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems 25, pages 1097-1105, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Curran Associates, Inc., 2012). While businesses may not want or be able to perform sophisticated analysis on large amounts of data, the general trend in machine learning is to use more data.
Additionally, organizations wishing to scale their data science efforts must increase the number of people who can work with their data. This may be compounded by the fact that the data to be analyzed may contain sensitive and/or private information that should not be freely shared with unauthorized individuals (e.g., a team of data scientists). Thus, to be able to share data for analysis, the organization would need to somehow anonymize sensitive and/or private information or remove portions of it entirely. Both tasks of anonymizing sensitive and/or private information, and omitting portions of information, are non-trivial and subject to flaws.
Anonymizing person-specific data is an option that allows organizations to publish data without leaking sensitive information like names or social security numbers. However, deciding which information to anonymize and which to share is a non-trivial task. For example, organizations in the past have freely released the date of birth, gender, and zip code of their customers. Alarmingly, these three pieces of information uniquely identify at least 87% of United States citizens. Furthermore, it may be possible to cross-reference information from multiple sources to de-anonymize additional information.
Omitting sensitive data is a different option that endeavors to protect the security and/or privacy of certain information. In particular, a conventional k-anonymity scheme purposefully omits individual entries of data in rows of a database to ensure that any row of data is indistinguishable from at least k−1 others. While this provides some extent of security, it nonetheless fundamentally changes the structure of the data. The modifications force anyone working with the data to change their approach.