With each generation of genomic sequencing systems, an unprecedented amount of molecular information becomes available for clinical diagnosis and treatment planning. This is particularly the case in oncology, where high-throughput molecular profiling is the key driver towards personalization of diagnostics, therapy selection and response assessment. As such, decision support systems for generating and providing informatics using sequence data of patient DNA can be critically important tools.
When genomic information is used to make clinical decisions, it is typically used in a linear fashion, where a set of standard analysis steps are sequentially performed to transform the raw genomic sequence to yield a particular set of informatics products, e.g. biomarkers, genetic variants, or other patient-specific features. This set of genomic information is then processed specifically for one patient within a single study, or for the purpose of answering one clinical question. Thus, conventional genomic informatics are generated without providing an opportunity to explore multiple clinical options, or to assess patient-specific data in the context of population-based cohorts including other patients' genomic information.
One of the largely unmet requirements for conventional genomic informatics systems is the need to provide genomic information security and privacy. Genomic or any information that can be used to identify patients is subject to government and industry regulation. Institutions and individuals that hold patient-identifying information, including those in research and clinical settings, are further required to notify patients in the event that data security or privacy is breached. As a result, genomic data holders are reluctant to share patient-specific genomic information.
Conventional systems and methods are limited in their capacity to protect genomic information with sufficient granularity and contextual sophistication. For example, as shown in FIG. 1, conventional data releases that include patient-specific genomic information focus on the proper credentialing of researchers, clinicians, and their affiliated institutions. Once the individual or institution is granted access, the genomic information in its raw form and the responsibility of protecting the information is passed onto the recipient. This approach, by far the most common, has several serious drawbacks from a security point of view. For example, it assumes that security can be maintained in subsequent distribution or dissemination of the information through transitive trust relationships when, in fact, subsequent recipients of the information can cause inadvertent privacy breaches.
This approach also does not provide any detailed audit information on individual instances of information access, despite its limitation that all of the data for a large study be centralized. In this approach, the security or privacy policies governing the use of the information cannot change dynamically or are simply not enforced, and often the recipients must be trusted to enforce the policies themselves.
Another conventional method provides an application programming interface (API) for providing secure access to the information via trusted data stores. This approach has two significant advantages. If the recipients are required to authenticate themselves, it is possible to discriminate amongst different recipients and apply potentially different policies to their accesses. This method can provide fine-grained access; recipients can ask only for the data required for a particular task. The information disclosed, or potentially disclosed, can be metered and audited. This level of auditing also allows potential sources of the leaks to be identified.
On the other hand, this approach still returns sensitive data into an untrusted environment. To see where this might pose a problem, consider a simple example: given a list of genome identifiers, determine the number of genomes that have a particular variant. Suppose that the API simply returns variants by identifier. Under these circumstances, the untrusted code will learn the individual variants for each of the genomes in the list, which were presumably chosen based on phenotypical characteristics. Those associations may be compromised, thereby weakening security.