Many diseases result from inherited or random mutations in a patient's genetic sequence. Additionally, in diseases such as cancer, advanced stages of disease may manifest as new changes in the genetic sequence of diseased cells. Accordingly, there is increased interest in sequencing diseased cells, e.g., from a biopsy or freely circulating, to determine the type or stage of a disease. Thus, patients who have undergone treatment for a disease may have new biopsy samples sequenced to monitor disease recurrence and/or progression. Such monitoring allows for early intervention in the event of recurrence, and also avoids unnecessary treatment when changes are not detected.
While there are many diseases that can be typed and tracked with genetic screening, cancer mutation screening has received the most attention. In some instances, the type of cancer can be immediately identified because of one tell-tale mutation, such as BRCA1. In most instances, however, cancer typing involves discovering and analyzing several sequences from a patient. Because these samples originate from the same patient, the samples are not independent of each other, but rather, are interrelated both developmentally and structurally. Furthermore, in most instances, accurate typing of a tumor requires knowledge of three sequences: the subject's healthy sequence (as found in non-cancerous parts of the body), the sequence of a major cancerous clone, and the sequences of minor clones (which may often be metastatic).
The prospect of sequencing several samples to obtain a complete picture of a disease is less intimidating due to recent advances in genetic sequencing. Next-generation sequencing (e.g., whole-transcriptome shotgun sequencing, pyrosequencing, ion semiconductor sequencing, sequencing by synthesis) can generate millions of reads, covering an entire genome, in just a few days. To achieve this throughput, NGS sequencing uses massive parallelization on smaller nucleic acid sequences that together make up a larger body of genetic information, e.g., a chromosome or a genome. Starting from a genetic sample, the nucleic acids (e.g., DNA) are broken up, amplified, and read with extreme speed. Once the reads are produced, the reads are aligned to a reference genome, e.g., GRCh37, with a computer to produce longer, assembled sequences, known as contigs. Because the sequence data from next generation sequencers often comprises millions of shorter sequences that together represent the totality of the target sequence, aligning the reads is complex and computationally expensive. Additionally, in order to minimize sequence distortions caused by random sequencing errors (i.e., incorrect sequencing machine outputs), each portion of the probed sequence is sequenced multiple times (e.g., 2 to 100 times, or more) to minimize the influence of any random sequencing errors on the final alignments and output sequences generated.
Once all of the data, corresponding to all of the nucleic acid reads is collected, and the reads are aligned against the reference, the reads are assembled and compared to the reference, as well as to each other, to determine the relationship between the samples. The workflow for this analysis is shown pictorially in FIG. 1. Each assembled read is typically compared to the reference whereupon the variations between the assembled sequence and the reference are cataloged in a file, known as a variant file, which may be in one of several accepted formats. These variant files can then be compared to each other, in order to determine how different the genetic material varies between cells with varying stages of the disease. The variant files can also be the basis for later comparison with new samples from the patient to screen for recurrence or disease progression.
The workflow illustrated in FIG. 1 suffers from several drawbacks. Because there may be millions of genetic differences between the reference sequence and the sequences of the samples of the patient, it is often very difficult to pinpoint the key differences between the non-diseased and diseased tissues. In theory, this problem can be avoided by comparing the sequences of the diseased and non-diseased samples directly, however the use of the reference sequence for the original alignment “infects” the downstream analysis. Typically, certain portions of the patient's samples that did not align with the reference are treated as equivalent mutations in the variant files, even though they are not, in fact, equivalent. Furthermore, structural variations between the reference sequence and the patient's samples, and between the patient's samples themselves, result in variant files with differing indexes for the same (or similar) mutations. Especially in the instance of recurrence screening, the lack of a stable index makes it very difficult to identify new smaller mutations.
Typically a sequence alignment is constructed by aggregating pairwise alignments between two linear strings of sequence information, one of which is a standard reference. As an example of alignment, two strings, S1 (SEQ ID NO. 15: AGCTACGTACACTACC) and S2 (SEQ ID NO. 16: AGCTATCGTACTAGC) can be aligned against each other. S1 typically corresponds to a read and S2 correspond to a portion of the reference sequence. With respect to each other, S1 and S2 can consist of substitutions, deletions, and insertions. Typically, the terms are defined with regard to transforming string S1 into string S2: a substitution occurs when a letter or sequence in S2 is replaced by a different letter or sequence of the same length in S1, a deletion occurs when a letter or sequence in S2 is “skipped” in the corresponding section of S1, and an insertion occurs when a letter or sequence occurs in S1 between two positions that are adjacent in S2. For example, the two sequences S1 and S2 can be aligned as below. The alignment below represents thirteen matches, a deletion of length one, an insertion of length two, and one substitution:
(S1)(SEQ ID NO. 15)AGCTA−CGTACACTACC (S2)(SEQ ID NO. 16)AGCTATCGTAC−−TAGC
One of skill in the art will appreciate that there are exact and approximate algorithms for sequence alignment. Exact algorithms will find the highest scoring alignment, but can be computationally expensive. The two most well-known exact algorithms are Needleman-Wunsch (J Mol Biol, 48(3):443-453, 1970) and Smith-Waterman (J Mol Biol, 147(1):195-197, 1981; Adv. in Math. 20(3), 367-387, 1976). A further improvement to Smith-Waterman by Gotoh (J Mol Biol, 162(3), 705-708, 1982) reduces the calculation time from O(m2n) to O(mn) where m and n are the sequence sizes being compared and is more amendable to parallel processing. In the field of bioinformatics, it is Gotoh's modified algorithm that is often referred to as the Smith-Waterman algorithm. Smith-Waterman approaches are being used to align larger sequence sets against larger reference sequences as parallel computing resources become more widely and cheaply available. See, e.g., Amazon.com's cloud computing resources available at http://aws.amazon.com. All of the above journal articles are incorporated herein by reference in their entireties.
The Smith-Waterman (SW) algorithm aligns linear sequences by rewarding overlap between bases in the sequences, and penalizing gaps between the sequences. Smith-Waterman also differs from Needleman-Wunsch, in that SW does not require the shorter sequence to span the string of letters describing the longer sequence. That is, SW does not assume that one sequence is a read of the entirety of the other sequence. Furthermore, because SW is not obligated to find an alignment that stretches across the entire length of the strings, a local alignment can begin and end anywhere within the two sequences.
The SW algorithm is easily expressed for an n×m matrix H, representing the two strings of length n and m, in terms of equation (1) below:Hk0=H0l=0(for 0≤k≤n and 0≤l≤m)Hij=max{Hi-1,j-1+s(ai,bj),Hi-1,j−Win,Hi,j-1−Wdel,0}(for 1≤i≤n and 1≤j≤m)  (1)In the equations above, s(ai,bj) represents either a match bonus (when ai=bj) or a mismatch penalty (when ai≠bj), and insertions and deletions are given the penalties Win and Wdel, respectively. In most instance, the resulting matrix has many elements that are zero. This representation makes it easier to backtrace from high-to-low, right-to-left in the matrix, thus identifying the alignment.
Once the matrix has been fully populated with scores, the SW algorithm performs a backtrack to determine the alignment. Starting with the maximum value in the matrix, the algorithm will backtrack based on which of the three values (Hi-1,j-1, Hi-1,j, or Hi,j-1) was used to compute the final maximum value for each cell. The backtracking stops when a zero is reached. See, e.g., FIG. 4 part (B), which does not represent the prior art, but illustrates the concept of a backtrack, and the corresponding local alignment when the backtrack is read. Accordingly, the “best alignment,” as determined by the algorithm, may contain more than the minimum possible number of insertions and deletions, but will contain far less than the maximum possible number of substitutions.
When applied as SW or SW-Gotoh, the techniques use a dynamic programming algorithm to perform local sequence alignment of the two strings, S and A, of sizes m and n, respectively. This dynamic programming technique employs tables or matrices to preserve match scores and avoid recomputation for successive cells. Each element of the string can be indexed with respect to a letter of the sequence, that is, if S is the string ATCGAA, S[1]=A, S[4]=G, etc. Instead of representing the optimum alignment as Hi,j (above), the optimum alignment can be represented as B[j,k] in equation (2) below:B[j,k]=max(p[j,k],i[j,k],d[j,k],0)(for 0<j≤m,0<k≤n)  (2)The arguments of the maximum function, B[j,k], are outlined in equations (3)-(5) below, wherein MISMATCH_PENALTY, MATCH_BONUS, INSERTION_PENALTY, DELETION_PENALTY, and OPENING_PENALTY are all constants, and all negative except for MATCH_BONUS. The match argument, p[j,k], is given by equation (3), below:
                                                                        p                ⁡                                  [                                      j                    ,                    k                                    ]                                            =                            ⁢                                                max                  ⁡                                      (                                                                  p                        ⁡                                                  [                                                                                    j                              -                              1                                                        ,                                                          k                              -                              1                                                                                ]                                                                    ,                                              i                        ⁡                                                  [                                                                                    j                              -                              1                                                        ,                                                          k                              -                              1                                                                                ]                                                                    ,                                              d                        ⁡                                                  [                                                                                    j                              -                              1                                                        ,                                                          k                              -                              1                                                                                ]                                                                                      )                                                  +                                                                                                      ⁢                              MISMATCH_PENALTY                ,                                                      if                    ⁢                                                                                  ⁢                                          S                      ⁡                                              [                        j                        ]                                                                              ≠                                      A                    ⁡                                          [                      k                      ]                                                                                                                                              =                            ⁢                                                max                  ⁡                                      (                                                                  p                        ⁡                                                  [                                                                                    j                              -                              1                                                        ,                                                          k                              -                              1                                                                                ]                                                                    ,                                              i                        ⁡                                                  [                                                                                    j                              -                              1                                                        ,                                                          k                              -                              1                                                                                ]                                                                    ,                                              d                        ⁡                                                  [                                                                                    j                              -                              1                                                        ,                                                          k                              -                              1                                                                                ]                                                                                      )                                                  +                                                                                                      ⁢                              MATCH_BONUS                ,                                                      if                    ⁢                                                                                  ⁢                                          S                      ⁡                                              [                        j                        ]                                                                              =                                      A                    ⁡                                          [                      k                      ]                                                                                                                              (        3        )            the insertion argument i[j,k], is given by equation (4), below:i[j,k]=max(p[j−1,k]+OPENING_PENALTY,i[j−1,k]d[j−1,k]+OPENING_PENALTY)+INSERTION_PENALTY  (4)and the deletion argument d[j,k], is given by equation (5), below:d[j,k]=max(p[j,k−1]+OPENING_PENALTY,i[j,k−1]+OPENING_PENALTY,d[j,k−1])+DELETION_PENALTY  (5)For all three arguments, the [0,0] element is set to zero to assure that the backtrack goes to completion, i.e., p[0,0]=i[0,0]=d[0,0]=0.
The scoring parameters are somewhat arbitrary, and can be adjusted to achieve the behavior of the computations. One example of the scoring parameter settings (Huang, Chapter 3: Bio-Sequence Comparison and Alignment, ser. Curr Top Comp Mol Biol. Cambridge, Mass.: The MIT Press, 2002) for DNA would be:
MATCH_BONUS: 10
MISMATCH_PENALTY: −20
INSERTION_PENALTY: −40
OPENING_PENALTY: −10
DELETION_PENALTY: −5
The relationship between the gap penalties (INSERTION_PENALTY, OPENING_PENALTY) above help limit the number of gap openings, i.e., favor grouping gaps together, by setting the gap insertion penalty higher than the gap opening cost. Of course, alternative relationships between MISMATCH_PENALTY, MATCH_BONUS, INSERTION_PENALTY, OPENING_PENALTY and DELETION_PENALTY are possible.
While the alignment methods, described above, have been useful for assembling reads produced with next-generation sequencing techniques, they are complex and time-consuming. Additionally, these techniques are ill-suited for identifying the important nuances between diseased cells of varying disease states because the uncertainty due to aligning the reads to a common reference often drowns out small changes in the genome.