The completion of the first human reference genome enabled the discovery of the whole catalogue of human genes, ushering in a new era of genomics research to discover the molecular basis of disease. One approach for genomic analyses relies on reference-guided alignment, the accurate mapping of relatively short sequence reads generated by Next Generation Sequencing (NGS) technologies to a reference genome. Reference-guided alignment has led to the development of new alignment and variant identification algorithms suited for these new high-throughput technologies, with an emphasis on speed and efficiency.
Alignment algorithms for reference-guided alignment seek to identify the most likely position on a reference genome from which a particular sequence read originated. To that end, alignment algorithms will allow some deviation in nucleotide sequence from the reference at a mapped position to allow for both unknown variation and error in the sequence read. However, this strategy leads to a phenomenon known as “reference bias,” in which the algorithm may force placement, and in so doing misalign, sequence reads against the reference. Further, sequence reads from a sample having regions of low homology with the reference genome may result in a high percentage of unaligned reads. These factors lead to low discovery and concordance rates for high complexity variants, such as short insertions and deletions (“indels”), inversions, duplications, tandem repeats, and mobile element insertions.
Accounting for variations within the reference itself can solve these issues to some extent. For example, sequence reads may be aligned against a reference sequence construct that accounts for one or more variants, which may be referred to as a graph reference sequence construct, or simply a graph reference. In contrast to a linear reference sequence, a graph reference can incorporate myriad types of information about genomic variations, minimizing the effects of reference bias. In a graph reference, variations may be represented by a respective path through the graph underlying the graph reference. Thus, known variations can be reliably accounted for and identified by aligning reads containing the known variation to a sequence path that includes that variation. This aids the accurate placement of sequence reads, and further allows for variations in a sample to be identified simply by noting the primary path on which the majority of sequence reads lie. Further, this improved alignment allows for more unknown variations to be discovered than by traditional means.
However, as the number of variants in a graph reference increases, the number of paths through the graph that must be evaluated during alignment also increases. In complex or dense regions of the graph, this can lead to a combinatorial explosion, rendering sequence read alignment computationally intractable. One approach to solving this problem is to reduce the number of variants represented by a graph, but this results in a loss of sensitivity. Accordingly, there is a need for improvements in aligning sequence reads to graph references.