Sharing private data is increasingly necessary for scientific progress in fields that are dominated by sensitive information. In many instances, the potential value of sharing data depends on the overlap between private datasets. The amount of overlap may factor into a decision to work through the institutional, legal, or ethical regulations governing access to private data.
Overlap estimates of the contents of databases may be used to provide evidence of non-independent samples in private databases such as rare disease registries. Such evidence has previously been difficult to generate without sharing the data. If the degree of overlap between private data sets could be estimated while maintaining the privacy of the underlying data, the benefits of investing in formal data sharing could be determined.
Existing methods of estimating overlap between private data sets either calculate exact overlap, or estimate overlap using a Bloom filter. The methods of calculating exact overlap focus on public identifiers that retain a one-to-one (or nearly one-to-one) relationship to the corresponding private identifiers. These efforts attempted to secure the private information in the database by using a one-way hash to generate public identifiers, so it is non-trivial to recover private identifiers from a list of public identifiers. However, this strategy is vulnerable to attack by enumerating valuable or common private identifiers (or just private identifiers of interest), and searching the list of public identifiers in a database for these valuable/common identifiers to see if the corresponding public identifiers is in the private dataset. Comprehensive attacks, which explore all possible private identifiers, may also be used to identify private information in a database encrypted using certain hash functions. Unfortunately, these comprehensive attacks work on any method that exactly computes the overlap between private datasets, including other cryptographic protocols like scrambled circuits. Countering any of these types of attacks requires keeping secret either the public identifiers, the hash function, or password protecting the public database, a solution that is perfectly acceptable for trusted environments, but inadequate for public, untrusted environments. Overlap estimates based on Bloom filters may be inaccurate with larger data sets and may lack a statistical framework for calculating significance, error, and information risk.
Several new ways of analyzing screening data have been developed. These methods include strategies for identifying active molecules from primary screens, which leverage information from fingerprints, scaffold groupings, economic modeling, and improved processing of raw data. These methods also include automated strategies of organizing screening data into workflows and a series of powerful strategies for visualizing how biological activity maps to chemical space. Finding ways to securely share relevant chemical information about screening data that leaves the structures blinded could open the door to valuable collaborative work, with the hope of both reducing cost and improving the value extracted from chemical data.
A suitable method of securely sharing databases should be sufficiently informative about the private datasets, thereby enabling computation of their overlap. Even a noisy estimate of overlap is informative if it is accurate enough to inform decisions about data sharing. The database sharing method's estimates of overlap should be stable, with a tunable accuracy at the same resolution for both small and large datasets. The database sharing method should be secure, making it difficult to determine the membership of any specific entity in a private dataset. Critically, security considerations suggest that overlap estimates should not be exact, but instead should be intrinsically noisy. Finally, the database sharing method should be public—neither relying on secret encrypted database passwords, private hash functions, nor back-and-forth message passing—with summaries publishable in public spaces.
Therefore, there is a need for encryption of private datasets that does not rely on the secrecy of the public identifiers or the method of generating the public identifiers and is further resistant to various methods of deducing private dataset elements such as private identifiers.