The present invention relates to the field of genomic sequencing and assembly, providing methods for assembling a genome using a shot-gun data set of DNA sequences.
The advent of automated DNA sequencing (Smith, L. M. et al., 1986, Nature 321, 674-679) and recent advances in scaling up of this technology (Adams et al., 1994, Nature 368, 474-475) have led to new approaches to determine the genome sequences of humans and other organisms. Classic approaches have included shot-gun sequencing of lambda inserts (xcx9c10-20 kbp) or cosmids (xcx9c40 kbp), and this approach has been successfully applied to certain large-scale projects, such as the genomes of Escherichia coli (Sofia et al., 1994, Nucleic Acids Res 22, 2576-2586), Saccharomyces cerevisiae (Levy, 1994, Yeast, 10, 1689-1706), Caenorhabditis elegans (Sulston et al., 1992, Nature 356, 37-41), and humans (Martin-Gallardo et al., 1992, Nature Genet 1, 34-39; McCombie et al., 1992, Nature Genet 1, 348-353). The shot-gun strategy is based on sequencing many small (on the order of 1-2 kbp), randomly chosen subclones of the lambda or cosmid clones. Typically, 300-500 bp of sequence is determined from one or both ends of the subclones. Enough sequence fragments must be obtained, however, and their positions distributed evenly enough so that all or most of the target DNA are covered by at least one fragment. Ideally, the coverage should be by as many fragments as are required to resolve ambiguities and provide confidence in the accuracy of the final consensus sequence.
Sequence assembly algorithms have been developed to reconstruct the original sequence from these fragments. The reconstruction process involves finding pairs of fragments that overlap, merging as many fragments together as possible, and creating a consensus sequence from the merged fragments. Sequence fragment assembly is a relatively straightforward programming problem, in the absence of complicating factors. There are assembly algorithms available, including XBAP (Gleeson and Staden, 1991, Comput Appl Biosci 7, 398), AutoAssembler (Applied Biosystems, Foster City, Calif.), GCG (Dolz, 1994, Methods Mol Biol 24, 9-23), the TIGR Assembler, and several others (reviewed in Miller and Powell, 1994, J Comp Bio 1, 257-269). Each of these algorithms has certain advantages and disadvantages and works reasonably well on lambda- or cosmid-sized assemblies. However, these algorithms typically do not perform well on reconstruction of an order or more of magnitude larger scale. A more recent assembly program, FALCON (Gryan, 1994, Genome Sequencing and Analysis Conference VI, abstract E/B-3 (Mary Ann Liebert, Inc., Larchmont, N.Y.)), uses a fast pairwise comparison method similar to that of the TIGR Assembler and can handle larger assembly projects, but still has a practical useful upper limit of about 10 MB. While certain other assembly programs could possibly be scaled up with some redesign effort, the functionality that many of these programs lack is a strategy to detect and handle repeat regions.
The lack of an adequate assembly engine has been one of the driving forces to maintain a cosmid-based genome sequencing strategy, even for megabase-scale projects. The major problems to be solved in reconstructing much larger assemblies include (1) a dramatic increase in the number of pairwise comparisons required, (2) the increased likelihood of encountering repetitive DNA that confounds true fragment overlaps with false overlaps, and (3) the increased probability of encountering chimeric clones that can cause false overlaps or mask true overlaps. In addition, many commentators have stated the whole genome shot-gun sequencing and assembly cannot be done with large genomes that contain repetitive elements (for example, see Green, 1997, Genome Research 7:410-417).
The present invention is based on the sequencing and assembly of the Drosophila melanogaster genome. During the sequencing of the genome, it became clear that existing methods used for assembling a genome using sequence information obtained from fragments of the genome would not work with large genomes and/or genomes with any significant degree of repeat sequence structure. The present invention provides a method and system that can be used to assemble the genome of any organism using a shot-gun data set of sequenced fragments.
The assembler of the present invention allows for the assembly of large and complex genomes (of greater than 10 Mbp and up to 5 Gbp) using a total shot-gun approach, even when there is significant amounts of repetitive DNA in the genome being assembled. Herein, we explain how the present assembler handles these complicating factors.
The present invention provides a new approach to assembling large genomes using a shot-gun data set of fragments. The basic design methodology of one embodiment of the assembler of the present invention is to:
1) Identify the repeats and leave them for last;
2) Assemble the unambiguously assemblable fragments first; and
3) Use all data available, including mate pair data, to resolve the repeat regions.
In one embodiment, the assembler of the present invention includes the following modules performing specific stages in the assembly of large genomes:
1) Screener: The Screener module identifies all sequences in the raw data which contain heterochromatic DNA or a known repetitive element. Examples of repetitive elements include rDNA and known interspersed repeats such as transposons.
2) Overlapper: The Overlapper module performs an xe2x80x9call against allxe2x80x9d comparison of all fragments and their complementary sequences output by the Screener module. The Overlapper module identifies all overlapping fragments that meet a minimal overlap value within a maximum percentage of mismatch. For example, the Overlapper module can be set to identify all fragments that have overlapping sequences of at least about 40 bases and allowing no more than about 6% mismatching.
3) Unitigger: Using the fragment overlap data produced by the Overlapper module, the Unitigger module identifies all consistent sub-assemblies of fragments (xe2x80x9cunitigsxe2x80x9d) and further identifies all sub-assemblies that cover unique DNA (xe2x80x9cU-unitigsxe2x80x9d).
4) Scaffolder: The Scaffolder module uses the U-unitig data in conjunction with mate pair data to generate an ordered framework (xe2x80x9cscaffoldxe2x80x9d) of the genome of U-unitigs. Input to the Scaffolder module includes mate pair data from different size fragment pairs. For example, fragment pairs can be generated from 2 kbp, 10 kbp, and 150 kbp clone inserts. The Scaffolder module confirms the ordering of U-unitigs through the use of two or more confirming mate pairs. In a scaffold, any set of U-unitigs that are overlapping are merged to form a single contig.
5) Repeat Resolution: The Repeat Resolution module fills gaps between the contigs produced by the Scaffolder module. The Repeat Resolution module employs an iterative process that first uses the highest quality data, such as doubly anchored mate pairs, and then proceeds to use data with more uncertainty, such as overlap-path, confirmed singled anchored mate pairs. Finally, the Repeat Resolution module employs a xe2x80x9cgreedy pathxe2x80x9d completion strategy that uses quality values to fill the repeat-induced gaps.
6) Consensus: The Consensus module generates a consensus sequence of the genome based on the information generated by the Scaffolder and Repeat Resolution modules. The Consensus module builds a Bayesian single nucleotide polymorphism (xe2x80x9cSNPxe2x80x9d) consensus sequence using quality values obtained through the assembly process.
A detailed description of each of these elements and the operation of the algorithm is provided below. All references cited herein are incorporated by reference in their entirety.