The science of bioinformatics applies sophisticated analytic techniques to biological data, such as genome sequences, to better understand the underlying biology. “Next generation” sequencing systems perform chemical analysis of a sample containing nucleic acid and generate many sequence “reads,” i.e., short nucleic-acid segments typically less than 1000 base pairs (bp) in length. Overlapping reads are aligned to a reference sequence (such as a genome) to reveal important genetic or structural information (e.g., biomarkers for disease). Ultimately, the goal of sequence alignment is to combine the set of nucleic acid reads produced by the sequencer to achieve a longer read (a “contig”) or even the entire genome of the sample source. Because the sequence data from next-generation sequencers often comprises millions of shorter sequences that together represent the totality of the target sequence, aligning the reads is complex and computationally expensive. Systems that perform this type of alignment may represent sequences as graph data structures (e.g., directed acyclic graphs); a representative system is described in U.S. Pat. No. 9,390,226, the entire disclosure of which is hereby incorporated by reference.
The graph-based references may be quite valuable, representing the results of multiple sequencing efforts that have been analyzed to identify variants—e.g., single-nucleotide polymorphisms (SNPs), structural variants, insertions and deletions—among different individuals of the same species. Candidate sequences, which may be very short “k-mers” (sequences of length k bp, where k is generally less than 100 and often less than 20) or longer reads, are analyzed against a reference sequence using an alignment tool, which determines the degree of similarity between the candidate sequence and the reference sequence over the entirety of the latter—that is, the alignment tool finds the best match between an input segment and the reference segment wherever this match occurs and reports a score indicating the quality of the match.
Although service bureaus that accept candidate sequences and perform alignments against proprietary reference sequences can easily maintain their physical security, these sequences nonetheless remain vulnerable to illicit reconstruction by intruders who may, for example, submit candidate sequences structured so that the resulting alignment provides information about the reference sequence graph. In sufficient quantity, such information can permit reconstruction or all or part of the graph. If a reference graph is made available publicly (e.g., for use with a proprietary alignment tool), those with access may simply copy or modify the graph in violation of contractual or other legal obligations.
Generating graph-based genome references is a time- and resource-intensive process. Graph genome content is not easily protectable, particularly when results generated by querying the graph are shared. Security methods are required to protect ownership of this shared resource and to dissuade malicious extraction of the data stored within.