Genetics has evolved from an analytical science to an information science. Whereas scientists previously struggled with how to extract and identify nucleic acids, such techniques are now trivial. Next-generation sequencing (e.g., whole-transcriptome shotgun sequencing, pyrosequencing, ion semiconductor sequencing, sequencing by synthesis) can generate millions of reads, covering an entire genome, in just a few days. To achieve this throughput, NGS sequencing uses massive parallelization on smaller nucleic acid sequences that together make up a larger body of genetic information, e.g., a chromosome or a genome. Starting from a genetic sample, the nucleic acids (e.g., DNA) are broken up, amplified, and read with extreme speed. In light of these capabilities, scientists now struggle with how to (inexpensively) align the reads to identify loci in the sequence that indicate a disease or a risk of a disease.
State-of-the-art alignment methods use massive computing power to align overlapping reads to a reference to produce a sequence that can be probed for important genetic or structural information (e.g., biomarkers for disease). Ultimately, the goal of sequence alignment is to combine the set of nucleic acid reads produced by the sequencer to achieve a longer read (i.e., a contig) or even the entire genome of the subject based upon a genetic sample from that subject. Because the sequence data from next generation sequencers often comprises millions of shorter sequences that together represent the totality of the target sequence, aligning the reads is complex and computationally expensive. Additionally, in order to minimize sequence distortions caused by random sequencing errors (i.e., incorrect sequencing machine outputs), each portion of the probed sequence is sequenced multiple times (e.g., 2 to 100 times, or more) to minimize the influence of any random sequencing errors on the final alignments and output sequences generated. Finally, once all of the data corresponding to all of the nucleic acid reads is collected, the reads are aligned against a single reference sequence, e.g., GRCh37, in order to determine all (or some part of) the subject's sequence. In many instances, the individual reads are not actually displayed, but rather an aligned sequence is assembled into a sequence, and the sequence is provided as a data file.
Typically a sequence alignment is constructed by aggregating pairwise alignments between two linear strings of sequence information. As an example of alignment, two strings, S1 (SEQ ID NO. 12: AGCTACGTACACTACC) and S2 (SEQ ID NO. 13: AGCTATCGTACTAGC) can be aligned against each other. S1 typically corresponds to a read and S2 correspond to a portion of the reference sequence. With respect to each other, S1 and S2 can consist of substitutions, deletions, and insertions. Typically, the terms are defined with regard to transforming string S1 into string S2: a substitution occurs when a letter or sequence in S2 is replaced by a different letter or sequence of the same length in S1, a deletion occurs when a letter or sequence in S2 is “skipped” in the corresponding section of S1, and an insertion occurs when a letter or sequence occurs in S1 between two positions that are adjacent in S2. For example, the two sequences S1 and S2 can be aligned as below. The alignment below represents thirteen matches, a deletion of length one, an insertion of length two, and one substitution:
(SEQ ID NO. 12)(S1) AGCTA-CGTACACTACC (SEQ ID NO. 13)(S2) AGCTATCGTAC--TAGC
One of skill in the art will appreciate that there are exact and approximate algorithms for sequence alignment. Exact algorithms will find the highest scoring alignment, but can be computationally expensive. The two most well-known exact algorithms are Needleman-Wunsch (J Mol Biol, 48(3):443-453, 1970) and Smith-Waterman (J Mol Biol, 147(1):195-197, 1981; Adv. in Math. 20(3), 367-387, 1976). A further improvement to Smith-Waterman by Gotoh (J Mol Biol, 162(3), 705-708, 1982) reduces the calculation time from O(m2n) to O(mn) where m and n are the sequence sizes being compared and is more amendable to parallel processing. In the field of bioinformatics, it is Gotoh's modified algorithm that is often referred to as the Smith-Waterman algorithm. Smith-Waterman approaches are being used to align larger sequence sets against larger reference sequences as parallel computing resources become more widely and cheaply available. See, e.g., Amazon.com's cloud computing resources available at http://aws.amazon.com. All of the above journal articles are incorporated herein by reference in their entireties.
The Smith-Waterman (SW) algorithm aligns linear sequences by rewarding overlap between bases in the sequences, and penalizing gaps between the sequences. Smith-Waterman also differs from Needleman-Wunsch, in that SW does not require the shorter sequence to span the string of letters describing the longer sequence. That is, SW does not assume that one sequence is a read of the entirety of the other sequence. Furthermore, because SW is not obligated to find an alignment that stretches across the entire length of the strings, a local alignment can begin and end anywhere within the two sequences.
The SW algorithm is easily expressed for an n×m matrix H, representing the two strings of length n and m, in terms of equation (1) below:Hk0=H0l=0 (for 0≦k≦n and 0≦l≦m)Hij=max{Hi−1,j−1+s(ai,bj),Hi−1,j−Win,Hi,j−1−Wdel,0} (for 1≦i≦n and 1≦j≦m)  (1)In the equations above, s(ai,bj) represents either a match bonus (when ai=bj) or a mismatch penalty (when ai≠bj), and insertions and deletions are given the penalties Win and Wdel, respectively. In most instance, the resulting matrix has many elements that are zero. This representation makes it easier to backtrace from high-to-low, right-to-left in the matrix, thus identifying the alignment.
Once the matrix has been fully populated with scores, the SW algorithm performs a backtrack to determine the alignment. Starting with the maximum value in the matrix, the algorithm will backtrack based on which of the three values (Hi−1,j−1, Hi−1,j, or Hi,j−1) was used to compute the final maximum value for each cell. The backtracking stops when a zero is reached. See, e.g., FIG. 3B, which does not represent the prior art, but illustrates the concept of a backtrack, and the corresponding local alignment when the backtrack is read. Accordingly, the “best alignment,” as determined by the algorithm, may contain more than the minimum possible number of insertions and deletions, but will contain far less than the maximum possible number of substitutions.
When applied as SW or SW-Gotoh, the techniques use a dynamic programming algorithm to perform local sequence alignment of the two strings, S and A, of sizes m and n, respectively. This dynamic programming technique employs tables or matrices to preserve match scores and avoid recomputation for successive cells. Each element of the string can be indexed with respect to a letter of the sequence, that is, if S is the string ATCGAA, S[1]=A, S[4]=G, etc. Instead of representing the optimum alignment as Hi,j (above), the optimum alignment can be represented as B[j,k] in equation (2) below:B[j,k]=max(p[j,k],i[j,k],d[j,k],0) (for 0<j≦m,0<k≦n)  (2)The arguments of the maximum function, B[j,k], are outlined in equations (3)-(5) below, wherein MISMATCH_PENALTY, MATCH_BONUS, INSERTION_PENALTY, DELETION_PENALTY, and OPENING_PENALTY are all constants, and all negative except for MATCH_BONUS. The match argument, p[j,k], is given by equation (3), below:p[j,k]=max(p[j−1,k−1],i[j−1,k−1],d[j−1,k−1])+MISMATCH_PENALTY, if S[j]≠A[k]=max(p[j−1,k−1],i[j−1,k−1],d[j−1,k−1])+MATCH_BONUS, if S[j]=A[k]  (3)the insertion argument i[j,k], is given by equation (4), below:i[j,k]=max(p[j−1,k]+OPENING_PENALTY,i[j−1,k],d[j−1,k]+OPENING_PENALTY)+INSERTION_PENALTY  (4)and the deletion argument d[j,k], is given by equation (5), below:d[j,k]=max(p[j,k−1]+OPENING_PENALTY,i[j,k−1]+OPENING_PENALTY,d[j,k−1])+DELETION_PENALTY  (5)For all three arguments, the [0,0] element is set to zero to assure that the backtrack goes to completion, i.e., p[0,0]=i[0,0]=d[0,0]=0.
The scoring parameters are somewhat arbitrary, and can be adjusted to achieve the behavior of the computations. One example of the scoring parameter settings (Huang, Chapter 3: Bio-Sequence Comparison and Alignment, ser. Curr Top Comp Mol Biol. Cambridge, Mass.: The MIT Press, 2002) for DNA would be:
MATCH_BONUS: 10
MISMATCH_PENALTY: −20
INSERTION_PENALTY: −40
OPENING_PENALTY: −10
DELETION_PENALTY: −5
The relationship between the gap penalties (INSERTION_PENALTY, OPENING_PENALTY) above help limit the number of gap openings, i.e., favor grouping gaps together, by setting the gap insertion penalty higher than the gap opening cost. Of course, alternative relationships between MISMATCH_PENALTY, MATCH_BONUS, INSERTION_PENALTY, OPENING_PENALTY and DELETION_PENALTY are possible.
Once the alignment is complete, the aligned sequences can be assembled to produce a sequence that can be compared to a reference (i.e., a genetic standard) to identify variants. The variants can provide insight regarding diseases, stages of disease, recurrence and the like. In the case of amino acid alignments, the assembled amino acid sequences can be compared to a standard to determine evolutionary information about the protein, or functional information about the protein.
A limitation of state-of-the-art alignment methods, e.g., Smith-Waterman, is that the alignment algorithms have difficulty aligning smaller reads, e.g., between 20 and 1000 base pairs (bp) in the presence of structural variations that are larger than or of similar size to the read. Structural variations are typically large sequence deviations, e.g., 100 bp or more, e.g., typically between 1 kilobase and 3 megabases in length. Structural variants can include duplications, inversions, translocations or genomic imbalances (insertions and deletions), and by definition they span more than several base pairs. Commonly known structural variations include copy-number variants whereby an abnormal number of copies of a specific genomic area are duplicated in a region of a chromosome. Such variations have been linked to cancer as well as autoimmune and neurological disorders.
When shorter reads, representing a portion of a sequence containing a structural variation, are aligned to a reference sequence using state-of-the-art techniques, the reads are often discarded as errant because the alignment scores of the reads against the reference are below a threshold for meaningful reads, since the structural variation isn't present in the reference. In other instances, the reads align, but the specific sequence of the read is discounted because the alignment score is low enough (and the aligned sequence is thus presumed to be “noisy” enough) that it is unclear whether a particular base in the sequence is a result of a mutation in the structural variation or just a misread of the “normal” structural variation.
An additional problem is presented when a mutation or variant, e.g., a small indel or polymorphism, is located in the read in close proximity to a structural variation. The difficulty of aligning the structural variation may cause reads containing both the mutation and the structural variation to be discarded as “unalignable.” The more such reads are discarded, the more likely it becomes that the mutation gets missed entirely. As a result, meaningful rare variants or mutations (used interchangeably here) close to structural variations can be rejected due to the low alignment score associated with the structural variation. These overlooked variants may have (undiscovered) roles in regulating disease.
Because failing to detect rare variants in proximity to structural variations significantly limits the quality of genetic analysis, there is a need for sequence alignment techniques that can account for structural variations, resulting in better alignment of rare mutations or variants.