Methods to sequence or identify significant fractions of the human genome and genetic variations within those segments are becoming commonplace. However, a major impediment to understanding health implications of variations found in every human being remains unraveling of the functional meaning of sequence differences in individuals. Sequencing is an important first step that allows geneticists and physicians to develop a full functional understanding of that data.
Next-generation sequencing (NGS) technologies include instruments capable of sequencing more than 1014 kilobase-pairs (kbp) of DNA per instrument run. Sequencing typically produces a large number of independent reads, each representing anywhere between 10 to 1000 bases of the nucleic acid. Nucleic acids are generally sequenced redundantly for confidence, with replicates per unit area being referred to as the coverage (i.e., “10× coverage” or “100× coverage”). Thus, a multi-gene genetic screening can produce millions of reads.
When a genetic screening is done for a person, the resulting reads can be compared to a reference, such as a published human genome. This comparison generally involves either assembling the reads into a contig and aligning the contig to the reference or aligning each individual read to the reference.
Assembling reads into a contig and aligning the contig to a reference produces unsatisfactory results due to the algorithms used for contig assembly. Generally, algorithms for contig assembly assess a read using certain quality criteria. Those criteria set a threshold at which certain reads that satisfy the algorithm are determined to be legitimate reads that are used to assemble the contigs, while reads that do not satisfy the algorithm are excluded from the contig assembly process. Based on the threshold level of the algorithm, as many as 10% of legitimate sequence reads are excluded from further analysis.
Additionally, aligning the resulting contig to a reference is error prone due to a tradeoff, inherent in doing an alignment, between whether mismatches or gaps (insertions/deletions, or “indels”) are favored. When one sequence is aligned to another, if one sequence does not match the other perfectly, either gaps must be introduced into the sequences until all the bases match or mismatched bases must appear in the alignment. Existing approaches to alignment involve algorithms with good mismatch sensitivity at the expense of indel sensitivity or good indel sensitivity at the expense of mismatch sensitivity. For example, if an alignment is to detect mismatches with sufficient fidelity, then it is likely that some indels will be missed.
Even where assembly provides accurate detection of variants (e.g., substitutions or indels), these methods are often computationally intractable for high throughput data analysis because each read must be compared to every other read in a dataset to determine sequence overlap and build contigs.
Another sequence assembly technique involves aligning each individual read to a reference. This assembly technique is problematic because very short reads (e.g., 50 bp or less) may align well in a number of places on a very long reference (e.g., 5 million bp). With a number of equally good positions to align to, aligning a read to a reference offers little positional accuracy. Also, particularly with very short reads, long indels can be difficult or impossible to detect. Due in part to the tradeoff between substitution sensitivity and indel sensitivity, certain mutation patterns are particularly difficult to detect. Indels near the ends of reads are sometimes incorrectly interpreted as short strings of mismatched bases. Substitutions near indels are often interpreted incorrectly, as well.
Existing methods of read assembly do not offer the positional accuracy of a contig-based alignment while including detailed information from each read. Further, due to limitations in alignment algorithms, existing methods do a poor job of correctly interpreting certain mutations (e.g., indels near the ends of reads, substitutions near indels).