Advances in sequencing technology make it possible to sequence the genome of an individual in a week or less. Typically, a genetic sample is isolated, broken into pieces, amplified, and then sequenced on a high-throughput system such as Illumina™ sequencing (Illumina, Inc., San Diego, Calif.). This process produces a huge number of sequence reads that must subsequently be assembled to create the sequence. The sequence provides very little useful information by itself, however, because the diagnostic and predictive value of the information depends upon the relative position of the sequence in the genome. That is, it is only possible to determine the presence of, e.g., a marker for a disease, when the relative position of the sequence within the genome is known. Additionally, once the sequence at a particular position is known, higher-level information, such as phenotypes, allelic identity, genotypes, etc., can be determined.
Because of the size and variability within each genome, locating where each read belongs is a substantial hurdle to genotyping in next generation sequencing (N.G.S.) The problem of determining a genotype based on a set of reads is naturally framed in evidential terms. Each read provides evidence of some genotype or set of genotypes, and combining the evidence from all the reads lets us conclude something about the subject's genotype. Both parts of this process—interpreting what a single read evinces about the subject's genome and aggregating the evidence of many reads—present problems, however. Furthermore, when dealing with actual genetic data, the sheer number of reads, and the presence of larger structural variations within the genetic data create a challenge analogous to a jigsaw puzzle with millions of pieces and little variations between pieces.
State-of-the-art alignment methods use massive computing power to align overlapping reads to a reference to produce an assembled sequence that can be probed for important genetic or structural information (e.g., biomarkers for disease). Ultimately, the goal of sequence alignment is to combine the set of nucleic acid reads produced by the sequencer to achieve a longer read (i.e., a contig) or even the entire genome of the subject based upon a genetic sample from that subject. Because the sequence data from next generation sequencers often comprises millions of shorter sequences that together represent the totality of the target sequence, aligning the reads is complex and computationally expensive. Additionally, in order to minimize sequence distortions caused by random sequencing errors (i.e., incorrect sequencing machine outputs), each portion of the probed sequence is sequenced multiple times (e.g., 2 to 100 times, or more) to minimize the influence of any random sequencing errors on the final alignments and output sequences generated. Finally, once all of the data corresponding to all of the nucleic acid reads is collected, the reads are aligned against a single reference sequence, e.g., GRCh37, in order to determine all (or some part of) the subject's sequence. In many instances, the individual reads are not actually displayed, but rather an aligned sequence is assembled into a sampled sequence, and the sampled sequence is provided as a data file.
Typically a sequence alignment is constructed by aggregating pairwise alignments between two linear strings of sequence information. As an example of alignment, two strings, S1 (SEQ ID NO. 20: AGCTACGTACACTACC) and S2 (SEQ ID NO. 21: AGCTATCGTACTAGC) can be aligned against each other. S1 typically corresponds to a read and S2 correspond to a portion of the reference sequence. With respect to each other, S1 and S2 can include substitutions, deletions, and insertions. Typically, the terms are defined with regard to transforming string S1 into string S2: a substitution occurs when a letter or sequence in S2 is replaced by a different letter or sequence of the same length in S1, a deletion occurs when a letter or sequence in S2 is “skipped” in the corresponding section of S1, and an insertion occurs when a letter or sequence occurs in S1 between two positions that are adjacent in S2. For example, the two sequences S1 and S2 can be aligned as below. The alignment below represents thirteen matches, a deletion of length one, an insertion of length two, and one substitution:
(SEQ ID NO. 20)(S1) AGCTA-CGTACACTACC (SEQ ID NO. 21)(S2) AGCTATCGTAC--TAGC
One of skill in the art will appreciate that there are exact and approximate algorithms for sequence alignment. Exact algorithms will find the highest scoring alignment, but can be computationally expensive. The two most well-known exact algorithms are Needleman-Wunsch (J Mol Biol, 48(3):443-453, 1970) and Smith-Waterman (J Mol Biol, 147(1):195-197, 1981; Adv. in Math. 20(3), 367-387, 1976). A further improvement to Smith-Waterman by Gotoh (J Mol Biol, 162(3), 705-708, 1982) reduces the calculation time from O(m2n) to O(mn) where m and n are the sequence sizes being compared and is more amendable to parallel processing. In the field of bioinformatics, it is Gotoh's modified algorithm that is often referred to as the Smith-Waterman algorithm. Smith-Waterman approaches are being used to align larger sequence sets against larger reference sequences as parallel computing resources become more widely and cheaply available. See, e.g., Amazon.com's cloud computing resources available at aws.amazon.com. All of the above journal articles are incorporated herein by reference in their entireties.
The Smith-Waterman (SW) algorithm aligns linear sequences by rewarding overlap between bases in the sequences, and penalizing gaps between the sequences. Smith-Waterman also differs from Needleman-Wunsch, in that SW does not require the shorter sequence to span the string of letters describing the longer sequence. That is, SW does not assume that one sequence is a read of the entirety of the other sequence. Furthermore, because SW is not obligated to find an alignment that stretches across the entire length of the strings, a local alignment can begin and end anywhere within the two sequences.
The SW algorithm is easily expressed for an n×m matrix H, representing the two strings of length n and m, in terms of equation (1) below:Hk0=H0l=0 (for 0≤k≤n and 0≤l≤m)Hij=max{Hi-1,j-1+s(ai,bj),Hi-1,j−Win,Hi,j-1−Wdel,0} (for 1≤i≤n and 1≤j≤m)  (1)In the equations above, s(ai,bj) represents either a match bonus (when ai=bj) or a mismatch penalty (when ai≠bj), and insertions and deletions are given the penalties Win and Wdel, respectively. In most instances, the resulting matrix has many elements that are zero. This representation makes it easier to backtrace from high-to-low, right-to-left in the matrix, thus identifying the alignment.
Once the matrix has been fully populated with scores, the SW algorithm performs a backtrack to determine the alignment. Starting with the maximum value in the matrix, the algorithm will backtrack based on which of the three values (Hi-1,j-1, Hi-1,j, or Hi,j-1) was used to compute the final maximum value for each cell. The backtracking stops when a zero is reached. See, e.g., FIG. 3 part (B), which does not represent the prior art, but illustrates the concept of a backtrack, and the corresponding local alignment when the backtrack is read. Accordingly, the “best alignment,” as determined by the algorithm, may contain more than the minimum possible number of insertions and deletions, but will contain far less than the maximum possible number of substitutions.
When applied as SW or SW-Gotoh, the techniques use a dynamic programming algorithm to perform local sequence alignment of the two strings, S and A, of sizes m and n, respectively. This dynamic programming technique employs tables or matrices to preserve match scores and avoid recomputation for successive cells. Each element of the string can be indexed with respect to a letter of the sequence, that is, if S is the string ATCGAA, S[1]=A, S[4]=G, etc. Instead of representing the optimum alignment as Hi,j (above), the optimum alignment can be represented as B[j,k] in equation (2) below:B[j,k]=max(p[j,k],i[j,k],d[j,k],0) (for 0<j≤m,0<k≤n)  (2)The arguments of the maximum function, B[j,k], are outlined in equations (3)-(5) below, wherein MISMATCH_PENALTY, MATCH_BONUS, INSERTION_PENALTY, DELETION_PENALTY, and OPENING_PENALTY are all constants, and all negative except for MATCH_BONUS. The match argument, p[j,k], is given by equation (3), below:
                                                                        p                ⁡                                  [                                      j                    ,                    k                                    ]                                            =                            ⁢                                                max                  ⁡                                      (                                                                  p                        ⁡                                                  [                                                                                    j                              -                              1                                                        ,                                                          k                              -                              1                                                                                ]                                                                    ,                                              i                        ⁡                                                  [                                                                                    j                              -                              1                                                        ,                                                          k                              -                              1                                                                                ]                                                                    ,                                              d                        ⁡                                                  [                                                                                    j                              -                              1                                                        ,                                                          k                              -                              1                                                                                ]                                                                                      )                                                  +                                                                                                      ⁢                              MISMATCH_PENALTY                ,                                                      if                    ⁢                                                                                  ⁢                                          S                      ⁡                                              [                        j                        ]                                                                              ≠                                      A                    ⁡                                          [                      k                      ]                                                                                                                                              =                            ⁢                                                max                  ⁡                                      (                                                                  p                        ⁡                                                  [                                                                                    j                              -                              1                                                        ,                                                          k                              -                              1                                                                                ]                                                                    ,                                              i                        ⁡                                                  [                                                                                    j                              -                              1                                                        ,                                                          k                              -                              1                                                                                ]                                                                    ,                                              d                        ⁡                                                  [                                                                                    j                              -                              1                                                        ,                                                          k                              -                              1                                                                                ]                                                                                      )                                                  +                                                                                                      ⁢                              MATCH_BONUS                ,                                                      if                    ⁢                                                                                  ⁢                                          S                      ⁡                                              [                        j                        ]                                                                              =                                      A                    ⁡                                          [                      k                      ]                                                                                                                              (        3        )            the insertion argument i[j,k], is given by equation (4), below:i[j,k]=max(p[j−1,k]+OPENING_PENALTY,i[j−1,k],d[j−1,k]+OPENING_PENALTY)+INSERTION_PENALTY  (4)and the deletion argument d[j,k], is given by equation (5), below:d[j,k]=max(p[j,k−1]+OPENING_PENALTY,i[j,k−1]+OPENING_PENALTY,d[j,k−1])+DELETION_PENALTY  (5)For all three arguments, the [0,0] element is set to zero to assure that the backtrack goes to completion, i.e., p[0,0]=i[0,0]=d[0,0]=0.
The scoring parameters are somewhat arbitrary, and can be adjusted to achieve the behavior of the computations. One example of the scoring parameter settings (Huang, Chapter 3: Bio-Sequence Comparison and Alignment, ser. Curr Top Comp Mol Biol. Cambridge, Mass.: The MIT Press, 2002) for DNA would be:
MATCH_BONUS: 10
MISMATCH_PENALTY: −20
INSERTION_PENALTY: −40
OPENING_PENALTY: −10
DELETION_PENALTY: −5
The relationship between the gap penalties (INSERTION_PENALTY, OPENING_PENALTY) above help limit the number of gap openings, i.e., favor grouping gaps together, by setting the gap insertion penalty higher than the gap opening cost. Of course, alternative relationships between MISMATCH_PENALTY, MATCH_BONUS, INSERTION_PENALTY, OPENING_PENALTY and DELETION_PENALTY are possible.
Once the alignment is complete, the aligned sequences can be assembled to produce a sequence that can be compared to a reference (i.e., a genetic standard) to identify variants. Only after the assembled reads are compared to the reference, is it possible to make determinations of the genotype of the sample. After comparing the assembled sequence to the reference sequence, the differences are catalogued and then compared to reference mutation files such as variant call format (VCF) files or a single nucleotide polymorphism database (dbSNP). This standard method of genotyping is time consuming, however, and often requires massive duplication of read coverage to assure that sequencing/amplification errors are not mistaken for true mutations.
The entire genotyping process is further complicated by the presence of structural variations in the genetic sample, i.e., longer (250 bp or more, e.g, 1000 bp or more) sequences that are inserted into or deleted from the “regular” genome. There may also be duplications, inversions, or translocations. In many instances, a sample is “called” for one genotype rather because of the presence of such structural variations, mutations, inversions, translocations, etc. In other instances, mutations within the structural variation result in a sample being “called” for yet a different genotype. In other instances, the sequences of interest have moved (relative to the “normal” position) because of their proximity to a structural variation.
Incorporating structural variations into the state-of-the art genotyping methods is very difficult, however, because there are many known variants from which any particular read might originate. For each N structural variations, there are approximately 2N different references that have to be compared to the assembled sequence in order to genotype the sequence. In other words, to accommodate one structural variant, the assembled sequence must be compared to at least 2 separate reference sequences to genotype the sample, but accommodating 20 possible structural variations requires comparing the sequence to roughly one million different references. Even using state-of-the-art methods with parallel computing, this is a very expensive proposition. Furthermore, this combinatorial explosion makes it impossible to feasibly genotype longer sequences that include hundreds of possible structural variations. Thus, approximations are made in the name of reducing computational time. In other words, current methods do not fairly represent the structural variations that are commonly found in many genomes.