New advances in human DNA sequencing and computing technology are coming together to allow rapid identification of the complete genomes of individuals with striking potential for transforming personalized medicine. At present, key limitations to realizing the full potential of whole genome sequencing (WGS) include the time, expense, and reliability of processing DNA sequence data to generate the list of all variations present in an individual and to reveal those alterations that cause disease or modify the risk of disease. Advances in the analysis of genome sequences are, thus, critical for diagnosis and potentially informative for treatment. Accordingly, bioinformatics and innovation in computational methods are key drivers for improving the analysis and interpretation of DNA sequence data.
The current price of WGS is approaching $1,000 for the laboratory procedures, but the time and expense of computational analyses and genetic interpretation remains significantly higher. Currently, the most widely used approaches report the set of differences between an individual and a global reference genome sequence, as most of the 3 billion positions are the same. To obtain the set of differences, the most widely used computational approaches are based on a two-step process: (1) read mapping and (2) variant calling. The data produced in WGS is a collection of millions of short (e.g., ˜150 letter) “reads”, where each letter is one of the four canonical DNA nucleotides (A, C, G or T). These “reads” are then placed onto (“aligned against”) a reference genome sequence in a manner akin to placing puzzle pieces onto a picture of the puzzle. Read mapping algorithms take a single read and scan through a reference genome to find where the read fits best. For each read there are billions of possible positions to be considered. Once all reads have been placed, there are usually redundant overlapping reads (on average 30-100 copies per position). An algorithm is then applied to pass over the 3 billion characters of the reference genome and evaluate statistically whether the mapped read data indicates a complete match to the reference genome at each position, or if there is evidence that the individual being sequenced differs from the reference either partially or completely.
However, the mapping of reads remains a significant challenge. Most of the single nucleotide variations from the reference genome present in any individual are polymorphic, meaning that many other individuals have the same sequence variation. It is estimated that roughly 85% of the variants found in each individual are polymorphic and 15% are rare variants. This variation can complicate alignment to the reference when sequence differences from the reference exist, potentially resulting in a failure to place the observed DNA sequence at the correct location. This reflects an allele bias in the reference. Reads that perfectly match the reference will be handled better than those that do not. Thus, the reference allele bias is a key source for errors in variant detection associated with the widely used practice of WGS analysis. The current approaches to this analysis is to select a reference sequence that is as similar to the new sequence as possible, and multiple approaches of this kind have been introduced (for example, using an ethnically similar reference genome for mapping). Such approaches are fundamentally oriented to using a reference which accounts for the common variants that are observed in a population. As a result, large-scale efforts are underway to provide a comprehensive inventory of human genetic variation, and to identify rare variants causal for genetic disorders and contributory to diverse diseases including for example, the dbSNP database, the HapMap project, and the 1000 Genomes project. In addition to these efforts, the recognition of the reference allele bias has also motivated efforts to develop alternative computational approaches to represent a reference genome that can efficiently include these known polymorphic locations and, therefore, allow improved read mapping. Such an approach would allow read mapping software to consider all recurring variations at each position, thereby eliminating or reducing the reference allele bias.
Another drawback to the existing alignment strategies is that the available reference genomes do not account for the existing variation that may be relevant for the comparison. For example, there is currently a global human reference genome (GRCh38) that is based on the combined genetic material of 13 anonymous individuals. A key limitation of the primary human reference genome is that it is but a single reference and does not account for known variations. For instance, if 51% of source people have an A at position X and 49% of source people have a C, the current reference genome would only report an A at that position. Beyond single character changes, the current reference genome also fails to allow for larger-scale properties, such as regions with variable numbers of repetitions or positions known to have variable structural rearrangements.
As indicated above, clinical WGS is dependent upon access to a reference genome, which provides the framework upon which sequence variation can be organized and reported. Much like solving a jigsaw puzzle, efficiency is significantly improved by the availability of a picture of the completed jigsaw to guide placement. As shown in FIGS. 1A-1D, the combined steps of alignment and variation calling are far from optimal as they fail to account for all available information and can product multiple, distinct results. Thus, any calling procedure introduces biases based on the reference genome used.
As an alternative to the current text-based reference genomes, graph-based models of DNA sequences have been explored. Graph models generally represent data using the concept of nodes and edges, where classically a node represents an observed property (e.g. a nucleotide at one position in a DNA sequence) and an edge represents movement from the previous position to the next. A variety of graph types having been introduced in the computer science field. Several common graph structures have been compared for their relevance to DNA sequence analysis (Kehr, B. et al. (2014) BMC Bioinformatics 15:99; incorporated herein by reference in its entirety). Common graph types such as De Bruijn graphs and string graphs have been used in procedures for de novo read assembly (see, e.g., Flicek, P. et al. (2009) Nature 6(11 Suppl):S6-S12; incorporated herein by reference in its entirety).
A graph-based reference genome has a number of advantages. It can represent all polymorphisms (recurrent variations) concurrently. Polymorphisms can be associated with a positional probability, allowing correlation between positions, as well as ethnic population differences (Dilthey, A. et al. (2015) Nature Genetics 47(6):682-8; incorporated herein by reference in its entirety), to be represented in a single graph. Importantly, graph models have a universally unique ID for each location (Paten B. et al. (2014) arXIV:1404.5010; incorporated herein by reference in its entirety) as more insertions and deletions are discovered, allowing a reference to be updated as new data become available.
While a full reference genome based on graphs has not yet been created, efforts are underway to develop a human genome variation map as a mathematical graph based on a De Bruijn graph, which represents DNA sequence as nodes. De Bruijn graphs are a compressed data structure. However, in most cases an adjunct data structure, such as read pair information, is needed to map reads onto the graph.
Despite the advances in generating graph-based models of reference sequences, many challenges remain. For example:                Linear representations allow for an unambiguous co-ordinate system and, thus, defining distances is simple. Within a flexible graph the represented components are relative and, therefore, searching for a particular region within the genome is a more complex problem.        Current annotations use the fixed linear co-ordinates system, which may occur on multiple paths through the graph. Accordingly, a graph reference genome would need to link the annotations to all subgraphs that represent that section of the reference.        The creation of randomized examples to serve as controls in an experiment is difficult.        Updating the graph reference genome as new information is discovered can be difficult due to the relative and flexible nature of graph structures.        File formats and incompatibility with current software and tools that are using the linear data structure create adoption problems.        Graphs have only recently been used to perform read alignment on genomic data and, therefore, there are fewer algorithms developed that utilize graphs.        Ease of visualization for large datasets. It can be difficult to view complicated non-planar data structures in a simple planar form.        
Another significant challenge with the current graph model methods is the need for enormous storage requirements for computational analysis. Whereas DNA is diploid for humans (i.e., two copies of each chromosome with one from each parent), each reference genome is haploid. Additionally, graph models do not account for the diversity of DNA sequence variation observed within and between populations. It has been shown that using a population reference De Bruijn graphs that combine multiple reference sequences as well as SNPs and Indels improves the accuracy of alignment algorithms (Dilthey, A. et al. (2014) Nature Genetics 47(6):682-8). However, the human genomes used by popular algorithms such as BOWTIE are approximately 2.3 GB in size without any “-omics” data. Hence, storing a complete picture of the human genome using current graph methods requires enormous storage and RAM. Thus, there remains a need for systems and methods to provide a reference genome that would capture all polymorphisms, such as single nucleotide changes, small insertions or deletions, and larger structural changes such as regional duplications, inversions, or translocations, as well as the correlations between all such variations, without creating an undue requirement for data storage and processing capacity.
A key challenge in genome analysis is the computational scale of the problems. Processing WGS can take multiple days on a multiple CPU supercomputer. In 2011, D-Wave Systems announced the first commercial quantum annealer, a new approach to supercomputing. Quantum computers optimize a function describing the system, over a set of candidate states using fluctuations (i.e., changes in the energy at points of the separable complex Hilbert space where the function operates). A quantum computer exploits superposition and entanglement, enabling it to consider all possible states simultaneously. To illustrate, there are over 6 billion base pairs in a human cell, and each location comprises of one of four nucleotides; hence, there are roughly 46 billion possible states. hese numbers are simply too big for classical computers unless significant data compression techniques are used, which leads to loss or distortion of information. This new development in technology offers a chance to transform DNA analysis; to accelerate and improve the quality of clinical results by simultaneously evaluating all possible states for reference DNA. There is, however, a major challenge related to the technology. The most advanced processors can only handle 256 qbits, which limits the capacity of the system to work with only small graphs composed of a few hundred nodes. Accordingly, there is a need for a graph representation of the genome which sharply reduces the size. Likewise, there is a need for a graph representation that can be analyzed using parallel computing techniques that are currently available until quantum computers become more powerful.
Other approaches to develop a graph representation of the genome include the DISCOVAR algorithm, which uses a De Brujin graph in which observed DNA sequences are represented as edges (Weisenfeld, N. et al. (2014) Nature Genetics 46(12):1350-55; incorporated herein by reference in its entirety). However, the assembly graph created using DISCOVAR does not give a probabilistic weighting to the edges in accordance with other “-omics” data available. The assembly graph lacks the flexibility to select known variants and therefore create an updated reference genome based on known variations between populations, etc. The assembly graph is a unipath graph, which is a directed graph derived from the k-mer graph where each node represents a k-mer sequence. Furthermore, as a k-mer graph, the DISCOVAR procedure produces a compressed graph and requires adjunct data structure to perform read alignment (such as read pair information) and is, therefore, not be suitable for use in quantum computing.
Despite the advances of the art in generating reference genomic sequences for assembly of short reads, there remains a need to produce a reference graph that represents all relevant polymorphisms simultaneously in an efficient and compact manner to optimize data storage and processing capacity, such as through quantum computing. The present disclosure provides methods and systems that address these and related needs.