This application relates generally to publishing data. More specifically, the disclosure provided herein relates to efficient publication of sparse data.
Many entities maintain or use data that includes sensitive information about clients, customers, users, and the like. These data can be valuable to the entities, for example, the data can be analyzed to determine usage patterns or trends, to identify and/or define audiences and potential audiences, to identify business development or improvement opportunities, and/or for other purposes. These data also can be valuable to the entities as a product that can be sold, leased, and/or otherwise shared with other entities for their own analysis, storage, and/or use.
One problem with using, storing, selling, or otherwise releasing these data is that the data often includes sensitive information. For example, entities sometimes store detailed demographic information about customers such as income information, shopping and purchasing histories, and the like. The data and associated sensitive information can include enough detail that third parties are able to apply analysis and data mining techniques to determine identities of one or more customers and their associated demographic information. As such, privacy of customers can be compromised by releasing data.
To address these and other concerns, various laws and regulations have been crafted to govern how data can be published or used without compromising privacy or security of customers or other entities. Various methods are used to release the data in accordance with these laws and regulations, many of which require extensive consumption of resources. For example, some anonymization schemes used to enforce privacy on released data include adding noise to the values of the released data, a process that requires modification of a large number of values associated with the data. For small data sets, these schemes are reasonable, but for large data sets these schemes can become unduly burdensome for the data owner and can make use or sharing of the data impracticable.
Furthermore, another challenge is faced when releasing or using sets or collections of sparse data, i.e., data or data sets having a large proportion, a majority, and/or a vast majority of zero-valued entries. For example, a data set of ten million commuters living and working across one million locations can result in a contingency table having 1,000,000,000,000 entries, the vast majority of which will have values of “0.” Simply storing this data set would consume an enormous amount of computing and/or storage resources, while adding noise to each entry or cell in this hypothetical contingency table would consume a dramatically greater amount of computing and storage resources, making such an approach for protecting privacy unwieldy or even unmanageable.