Analysis of the subtleties of the voluminous amounts of genetic information will continue to have profound effects on the personalization of medicine. For example, this advanced genetic knowledge of patients has and will continue to have broad impact on the ability to diagnose diseases, identify predispositions to diseases or other genetically impacted disorders, and identify reactivity to given drugs or other treatments, whether adverse or beneficial.
Before one can begin to interpret genetic data from patients, one must first obtain the genetic information from that patient. Technologies have been developed that allow for broad screening of large swaths of a patient's genetic code by identifying key signposts in that code and using this fragmented data as a general interpretation mechanism, e.g., using libraries of known genetic variations, such as SNPs or other polymorphisms, and correlating the profile of such variations against profiles that have a suspected association with a given disease or other phenotype.
Rather than rely upon disparate pieces of information from the genetic code, it would be of far more value to be able to rely upon the entire text of a patient's genetic code in making any interpretations from that code. In using an analogy of a novel, one gains a substantially deeper understanding of all the elements of the novel, e.g., plot, characters, setting etc., by reading the entire text, rather than by picking out individual words from disparate pages or chapters of the novel.
Technologies related to analysis of biological information have advanced rapidly over the past decade. In particular, with the improved ability to characterize genetic sequence information, identify protein structure, elucidate biological pathways, and manipulate any or all of these, has come the need for improved abilities to derive and process this information.
In the field of genetic analysis, for example, faster and faster methods of obtaining nucleic acid sequence information have consequences in terms of requiring different and often times better methods and processes for processing the raw genetic information that is generated by these processes. This progress has been evidenced in the improvements applied to separations based Sanger sequencing, where improvements in throughput and read-length have come not only through multiplexing of multi-capillary systems, but also from improvements in base calling processes that are applied to the data derived from the capillary systems. With shifts in the underlying technology surrounding genetic analysis, also comes a necessity for a shift in the methods and processes for processing the information from these systems. The present invention provides solutions to these and other problems.
In a resequencing study, a researcher gains insight for a biological question by comparing how the sequence of a sample differs from a reference genome. The positions in the genome where the differences occur are typically not known, and may be detected by randomly sequencing small fragments of the sample and comparing them to the reference, a method known as shotgun resequencing. The locations of the randomly sequenced fragments are not known, so an initial step in resequencing is to align the reads to their homologous locations on the reference genome.
Since the sample genome has mutations and reads have errors, homology is typically defined as the most similar sequence in the reference to the read and formulated as the highest scoring local alignment. This may be simply found with Smith-Waterman alignment (Smith, et al. (1981) J. Mol. Biol. 147:194-197); however, this is computationally prohibitive, so heuristics are necessary. A successful heuristic is sensitive to genomic variation and sequencing error; as sequencing methods have changed methods for aligning reads have evolved in step. Initial resequencing projects, such as the International Hap Map Project (Nature 409:31-46 (2001)), used Sanger sequencing (Sanger, et al. (1977) Proc. Natl. Acad. Sci. 74:5463-5467). The instrument frequently used for Sanger sequencing, the ABI 3730, produces reads roughly 1000 bases long with an accuracy over 99.5% that are rapidly aligned using MEGABLAST [25], cross match [11], and/or BLAT [14]. Each of these methods employ a Seed-and-Extend heuristic search, where short, 8-11 base, exact matches (words) are found using a hash table of the genome, and a detailed alignment is performed around regions that contain a sufficiently high number of exact matches. With reads of several hundred bases that are highly accurate, this is sufficient for aligning reads from sample genomes within the 0-1% range of common human genetic variation.
The methods developed to align reads produced by Sanger sequencing do not perform well on reads produced by Second-Generation massively parallel sequencing platforms such as the Illumina HiSeq (San Diego, Calif., USA) and Life SOLiD (Foster City, Calif., USA). Both platforms read bases four orders of magnitude faster than the state of the art in Sanger sequencing; however the reads have lower accuracy, and shorter length: 100 bases in the Illumina HiSeq and 75 bases in SOLiD 4. These platforms use the technique of amplified and cycled (AC) sequencing where millions of short templates are amplified while kept spatially separate, and then sequenced in controlled cycles of base interrogating reactions and imaging (Margulies, et al. (2005) Nature 437:376-380; Bentley, et al. (2008) Nature 456:53-59).
The initial methods developed for aligning AC sequenced reads such as Eland (Illumina), SOAP (Li, et al. (2009) Bioinformatics 25:1966-1967), and MAQ (Li, et al. (2008) Genome Research 18:1851-1858) were based on hashing methods but achieved much faster performance than MEGA-BLAST or BLAT by bounding the number of differences allowed in a match between a read and the genome. A major algorithmic breakthrough for aligning AC reads was developed in the Bowtie alignment program (Langmead, et al. (2009) Genome Biology 10:R25) by using the Burrows-Wheeler Transformation (BWT) with a Ferrangina-Manzini (FM) index (Ferragina, et al. (2000) In Proc. of the 41st IEEE Symposium on Foundations of Computer Science, pages 390-398) of a genome rather than hash tables to detect matches between a read and a genome. The BWT-FM index, described in detail below, supports O(Q) time queries for counting the number of times a query string is present in a target, where Q is the length of the query string. Reads that map to the genome without differences are found very quickly, and reads that have low error rates in the 5′ end may have their prefix mapped to a very small number of candidate positions in the genome before scoring each alignment (Li, et al. (2008) Genome Research 18:1851-1858). As a result, the Bowtie method was one to two orders of magnitude faster than hash-based methods (Langmead, et al. (2009) Genome Biology 10:R25). Other mapping algorithms such as Maq and SOAP were revised to run queries using BWT-FM indices as well, with similar speedup (Li, et al. (2009) Bioinformatics 25:1754-1760; Li, et al. (2009) Bioinformatics 25:1966-1967). These methods have been robust in aligning data produced by recent updates of AC sequencing instruments, which have doubled the length of reads and increased throughput by nearly two orders of magnitude since the BWT-FM based methods were introduced.
Advances in isolation and imaging of single molecules have facilitated the development of methods for sequencing single molecules. The Pacific Biosciences® Single-Molecule Real-Time (SMRT®) sequencing platform produces reads by detecting fluorescently labeled nucleotides as a template sequence is replicated by DNA polymerase (Korlach, et al. (2008) Nucleosides, Nucleotides and Nucleic Acids 27:1072-1083; Eid, et al. (2009) Science 323:133-138; Levene, et al. (2003) Science 299: 682-686, the disclosures of which are incorporated herein by reference in their entireties for all purposes). The polymerase and template are bound to the bottom of a zero-mode waveguide (ZMW) that limits the detection volume to the zeptoliter scale allowing the signal from the incorporated nucleotides to be distinguished from the background signal of nucleotides in solution. An alternative method has been shown to detect bases that have been cleaved by endonuclease and pass through a protein nanopore by monitoring modulations of ionic current across the pore (Clarke, et al. (2009) Nature Nanotechnology 4:265-270, incorporated herein by reference in its entirety for all purposes). Finally, a method has been recently demonstrated for identifying bases that have translocated through a nanopore fabricated in a graphene membrane (Garaj, et al. (2010) Nature 467:190-193). The mapping methods written for AC sequencing reads will likely not work well on SMS reads, and older alignment methods such as Blast will be too slow. Thus there is a need for the development of new mapping methods for single molecule sequencing reads.