The increase in computer use has resulted in an increase of available data. Companies are currently taking advantage of opportunities to monetize their data by selling or sharing their data with third parties, such as advertisers, and participating in collaborative data sharing initiatives, such as collaborative security. Transfer or sharing of the data can provide benefits to the data holder and well as the data recipient. For example, data holders, such as a social network, may provide their data to the data recipient in exchange for a monetary value, and the data recipient can utilize the data for providing a new service, starting a new company, or conducting research, among other opportunities.
However, data often includes inconsistencies, conflicts, and errors, which can increase data processing costs and have a negative impact on data analytics. Thus, data recipients may end up spending more time and money than expected to clean data acquired from another party prior to use. Determining the quality of a dataset prior to obtaining the data can help a business to make an informed determination regarding whether or not to acquire the dataset.
Conventional means to determine data quality and automatically clean the data exist. In one approach, audits are used to assess a quality of data held by a third party. During an audit, an individual or organization obtains full access to the data and directly examines the quality of the data. Another approach includes sharing data snippets that reflect the quality of the overall dataset to which the data snippets belong. However, both approaches breach privacy of the data. Further, a different approach includes authorizing potential clients to request computation of certain data quality metric, but the data quality metric is not kept private and allows the data holder to obtain information regarding a potential recipient of the data.
A further approach, known as the Private Set Intersection (PSI), attempts to conduct a privacy-preserving data quality assessment. PSI allows two parties to compute the intersection of their data while protecting privacy of the data for each party. Also, Private Set Intersection Cardinality (PSI-CA) reveals to each party the cardinality of the data set intersection. However, both the PSI approaches have extremely high overhead and are not practical for computing multiple data quality metrics.
Therefore, there is a need for efficiently determining the quality of a data set without disclosing the actual data to a potential recipient. Preferably, the data quality metric is provided as a private data element that cannot be seen by third parties.