Advances in biomolecule sequence determination, in particular with respect to nucleic acid and protein samples, have revolutionized the fields of cellular and molecular biology. Facilitated by the development of automated sequencing systems, it is now possible to sequence an entire genome, for example, of a micro-organism. However, the quality of the sequence information must be carefully monitored, and may be compromised by many factors related to the biomolecule itself or the sequencing system used, including the composition of the biomolecule (e.g., base composition of a nucleic acid molecule), experimental and systematic noise, variations in observed signal strength, and differences in reaction efficiencies. As such, processes must be implemented to analyze and improve the quality of the data from such sequencing technologies.
Besides affecting overall accuracy of sequence reads generated, these factors can complicate designation of a base-call as a true variant or, alternatively, a mis-call (e.g., insertion, deletion, or mismatch error in the sequence read). For example, in a diploid organism a chromosome can have loci that differ in sequence from the homologous chromosome. It is important to be able to call whether such a locus is a true variation between the homologues, or is a sequencing error. Furthermore, a viral population in an individual can have many variations between individual viral genomes in the population, especially in highly mutable viruses such as HIV. Being able to identify different sequencing reads that have different origins (e.g., different chromosome or genome origins) is key to being able to accurately characterize a nucleic acid. For a theoretical sequencing platform that generates reads that are 100% accurate, the reads can simply be compared to one another with simple string matching algorithms. Any difference between the reads is indicative of a true variant, and therefore, a different origin. However, any real-world raw sequencing data is likely to contain errors, so a simple string matching algorithmic approach will not be sufficient in most cases.
Sequencing applications generally fall into two categories, de novo assembly and re-sequencing. Both efforts require highly-automated, accurate assembly of nucleic acid fragments into contigs. They differ from one another in that de novo assembly is performed using overlapping reads, while re-sequencing assumes knowledge of a reference sequence and maps reads to the reference. Although establishing relative read position is significantly easier for re-sequencing, the subsequent task of calling a consensus base for each aligned column in the contig or alignment is still challenging.
The standard of sequencing accuracy was set to 99.99% by the National Human Genome Research Institute (NHGRI) in 1998. While a single base-call for each position in a template may not achieve such accuracy, with increases in coverage multiple overlapping sequencing reads for a template sequence having lower raw read accuracy can be used to determine a consensus sequence with acceptably high accuracy. Consensus calling algorithms attempt to distinguish sequencing error from variants (e.g., SNP's) using multiple “queries” for a given position. A variety of such algorithms have been developed to address changes in sequencing coverage, error profiles, and information accompanying base-calls as new sequencing systems are developed, e.g., Li, et al. (2008) Genome Res. 18:1851-1858; and Chen, et al. (2007), “PolyScan: An automatic indel and SNP detection approach to the analysis of human resequencing data,” Genome Res. 17(5):659-666. Other methods and algorithms that may be used with or are otherwise related to the methods provided herein are found in G. A. Churchill, M. S. Waterman (1992) “The Accuracy of DNA Sequences: Estimating Sequence Quality,” Genomics 14: 89-98; M. Stephens, et al. (2006) “Automating sequence-based detection and genotyping of SNPs from diploid samples,” Nat. Genet., 38: 375-381; Li, et al (2008) “Mapping short DNA sequencing reads and calling variants using mapping quality scores,” Genome Research 18(11):1851-8; and Chen, et al. (2007), Genome Research, 17(5):659-666, the disclosures of which are incorporated herein by reference in their entireties for all purposes. Additional methods and algorithms that may be used with or are otherwise related to the methods provided herein are found in U.S. patent application Ser. No. 13/468,347, filed May 10, 2012; U.S. Patent Publication Nos. 2009-0024331 and 2010-0169026, and U.S. Patent Application Publication Nos. 2011-0257889, and 2011-0256631, both published Oct. 20, 2011, all of which are incorporated by reference herein in their entireties for all purposes.
Most third party genome assemblers, e.g., Celera®Assembler®, assume that the overlap between the reads can be detected with high identity. For example, an overlap might be called when the identity in the alignment between two reads is above 94%. While it is not necessary to assemble the sequence of an entire genome using such stringent requirements, (e.g., the ALLORA assembler from Pacific Biosciences, Menlo Park, Calif., can use reads that only have 70% identity between each other), it remains preferable to construct inputs whose overlap can be detected with high identity before passing them to a third party assembler. Moreover, when there are repeats in a genome, it is also favorable to generate input that can clearly distinguish the different repeats. Finally, it is also preferable that some artifacts, e.g., chimeric reads and high quality region identification errors, due to sequencing reactions, can be filtered out before the assembly step.
Sequencing technologies that combine reads from libraries of different lengths of DNA have been developed to generate reads that can satisfy the more stringent input requirements for third party assemblers. However, most of these methods require preparation and separate sequencing of multiple DNA libraries.
Single Molecule, Real-Time (SMRT®) DNA sequencing provides a method to generate sequencing reads that are much longer than those possible with second-generation methods or even Sanger sequencing, thereby facilitating a more effective pathway towards de novo genome assemblies and genome finishing (see, e.g., Rasko, Webster et al., 2011; English, Richards, et al., 2012). For typical bacterial genome sizes (1-10 Mb), hybrid assembly approaches have been described which utilize the long SMRT® sequencing reads in conjunction with shorter reads (from SMRT® circular consensus sequencing reads or second-generation sequencing methods) for generating, for the first time, finished high-quality genome assemblies in automated workflows (Bashir, Klammer et al., 2012; Koren, Schatz et al., 2012; Ribeiro, Przybylski, et al., 2012). While these strategies have been applied successfully to a variety of microbes and also to eukaryotic organisms, the hybrid assembly principle requires the preparation of at least two different sequencing libraries and several types of sequencing runs (and sometimes several different sequencing methods). For more efficient and cost-effective genome closing, a homogenous workflow only requiring one library and sequencing method would be desirable.
The discussion of the background herein is included to explain the context of the technology. This is not to be taken as an admission that any of the material referred to was published, known, or part of the common general knowledge as at the priority date of any of the claims found appended hereto.
Throughout the description and claims of the specification the word “comprise” and variations thereof, such as “comprising” and “comprises”, is not intended to exclude other additives, components, integers or steps.